Tau-Bench

Code and Data for Tau-Bench

Tau-Bench is a comprehensive benchmark designed to evaluate the performance of tool-calling agents in complex, real-world scenarios, specifically within retail and airline environments. It allows researchers and developers to rigorously test their agents using various user simulators, including models like gpt-4o and Claude, as well as different simulation strategies such as react, verify and reflection. A key feature is its auto error identification tool, which pinpoints specific error locations within agent trajectories, significantly reducing manual debugging efforts. This tool, while leveraging LLMs, provides labeled error descriptions to enhance understanding.

Tau-Bench offers flexibility with customizable task execution, model usage, and simulation strategies. It also includes a set of historical trajectories for both airline and retail environments, facilitating faster initial testing and validation. The benchmark is designed to be extensible and type-safe, and welcomes community contributions to further expand its capabilities and applicability. It is licensed under the MIT license, offering freedom for modification and re-use.

https://github.com/sierra-research/tau-bench

Similar

AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents

... more

TinyTroupe

LLM-powered multiagent persona simulation for imagination enhancement and business insights.

... more

UI-TARS

UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities.