Tau-Bench is a comprehensive benchmark designed to evaluate the performance of tool-calling agents in complex, real-world scenarios, specifically within retail and airline environments. It allows researchers and developers to rigorously test their agents using various user simulators, including models like gpt-4o and Claude, as well as different simulation strategies such as react, verify and reflection. A key feature is its auto error identification tool, which pinpoints specific error locations within agent trajectories, significantly reducing manual debugging efforts. This tool, while leveraging LLMs, provides labeled error descriptions to enhance understanding.
Tau-Bench offers flexibility with customizable task execution, model usage, and simulation strategies. It also includes a set of historical trajectories for both airline and retail environments, facilitating faster initial testing and validation. The benchmark is designed to be extensible and type-safe, and welcomes community contributions to further expand its capabilities and applicability. It is licensed under the MIT license, offering freedom for modification and re-use.