AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents

AgentBench is a comprehensive benchmark designed to rigorously evaluate the capabilities of Large Language Models (LLMs) as autonomous agents across a diverse range of environments. This framework encompasses eight distinct tasks, including operating system interaction, database analysis, knowledge graph navigation, game playing, lateral thinking puzzles, embodied household tasks, and online shopping. By simulating real-world scenarios, AgentBench provides a robust assessment of an LLM’s ability to understand complex instructions, plan strategically, and interact with various interfaces to achieve specific goals. This unique benchmark offers detailed analysis of model performance, including success rates, win rates, and game progress metrics.

AgentBench provides a flexible and modular framework for evaluation. Its design decouples core components like the Task Server, Agent Server, and Client, allowing independent development and deployment. This framework supports a configuration system using YAML, with extended keywords to ensure easy customization and flexibility. With clear documentation and automated scripts, AgentBench offers a practical solution for researchers and developers to thoroughly test and enhance the real-world usability of their LLMs, bridging the gap between theoretical performance and practical application of AI agents.

https://github.com/THUDM/AgentBench