Agent Evaluation

Agent evaluation is the systematic measurement of an AI agent's ability to complete tasks through multiple decisions, model calls, and tool interactions. It extends conventional evaluation beyond the quality of a single generated response.

An agent may produce a plausible final answer despite using an unsafe or inefficient process. It may also take reasonable intermediate actions but fail because of an external dependency. Agent evaluation therefore examines both outcomes and execution trajectories.

Important dimensions include:

Task success: whether the requested outcome was actually achieved.
Correctness: whether outputs and external state changes are accurate.
Tool use: whether the agent selected appropriate tools and supplied valid arguments.
Efficiency: model tokens, tool calls, latency, retries, and monetary cost.
Robustness: performance under ambiguous instructions, tool failures, and interface changes.
Safety: unauthorized actions, sensitive-data exposure, and susceptibility to prompt injection attacks.
Human intervention: how often approval, correction, or recovery is required.

Agent evaluations should run in controlled environments with resettable state. Each task needs an initial state, success criteria, allowed actions, time or step limits, and a deterministic method for inspecting the final environment where possible. For example, a coding-agent task can be graded by tests and repository state, while a customer-support task may be checked against database changes and policy constraints.

Because model behavior is stochastic, one run is rarely sufficient. Repeated trials can reveal variance, brittle strategies, and rare high-impact failures. Evaluation datasets should include normal cases, edge cases, adversarial inputs, and regression cases collected from production incidents.

Trajectory-level analysis is valuable for debugging agentic workflows. Agent traces can show incorrect routing, unnecessary loops, context loss, bad tool selection, and failed recovery. However, inspecting a model's stated reasoning is not a substitute for verifying actions and outcomes.

Automated graders, including LLM-as-a-Judge, can scale qualitative assessment but should be calibrated against human judgments and deterministic checks. High-stakes criteria should not depend solely on another model's opinion.

Agent evaluation is most useful as part of evaluation-driven development: define representative tasks, establish baselines, inspect failures, change prompts or orchestration, and rerun the same suite before deployment.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

—

Agent Trace and Observability