Evaluating agents
This blog post is a companion guide to Anthropic's research on building effective agents. It requires prior knowledge of agentic systems.
Building an agentic system, whether it’s a simple augmented large language model (LLM) or a fully autonomous agent, involves many moving parts. Your design might make sense on paper, but without measurement and iterative improvement, unexpected issues can pop up in production. Inspired by Anthropic’s guide to building effective agents, in this post, we’ll walk through practical strategies for evaluating the quality and accuracy of agentic systems.
Why run evaluations?
LLM-based agents:
- Can have unpredictable error modes (hallucinations, repetitive loops, etc). For example, a financial advisory chatbot might hallucinate stock market predictions based on fictional data, potentially misleading users.
- Can rely on context, memory, or retrieval tools that may have their own failures. For example, a legal research assistant might misinterpret a poorly retrieved document and provide incorrect case law summaries.
- Iteratively refine their own outputs, which can hide the root cause of an error. For example, a content generation tool might produce a misleading statement in an early draft, then repeatedly refine and expand on it, amplifying the inaccuracy in the final version.
Evals help you detect and debug these issues before they impact your users. They can also help you decide how you might improve your application.
Choosing eval metrics
Because agentic systems can be extremely complex, there’s no one-size-fits-all set of metrics, but we’ll explore an example of an agent and a potential set of scorers for each pattern defined in the guide to building effective agents. Metrics refer to the quantitative or qualitative measures used to evaluate how well the agent performs against its intended objectives, while scorers are the code-based or LLM-as-a-judge functions used to calculate these metrics. Like the guide, we’ll start simple and increase in complexity. Think of each set of scorers as a menu to pick and choose from depending on your specific system’s goals and constraints.
Quantitative vs. qualitative metrics
Quantitative metrics help track performance over time and compare different implementations. Some examples include accuracy against a test dataset, average cost per request, or average response latency. These metrics measure LLM behaviors with standardized checks. Scorers for these metrics are logical and deterministic. Qualitative metrics measure whether the agent is achieving the desired experience for the end user. These usually come from user feedback or support tickets, and require some sort of human review or LLM-as-a-judge analysis. These qualitative checks capture nuances that raw metrics can miss. Scorers for these metrics can measure things like trends, "vibes", or lagging indicators.
In many cases, you’ll want both. For example, you might track how often your agent successfully handles a customer request and measure user satisfaction with the agent’s helpfulness.
Iterative evaluation process
Generally, a reasonable first pass at evaluating any agent is to start by looking at the final output. This gives a high-level view of the agent’s performance. If you notice failure points, you can go deeper on a particular intermediate step and add more evaluations, until you build a comprehensive eval system. You might find that you need more granular evals for things like:
- Behavioral patterns - step-by-step decisions
- Tool effectiveness - how well the agent uses specific tools or APIs
- Cost efficiency - monitoring resource usage over multiple iterations
Building block: the augmented LLM
- First, the LLM decides whether it needs data from an external knowledge base or a particular tool.
- If retrieval or a tool call is needed, we append the results to our prompt so the next LLM call can incorporate that new info.
- Finally, we ask the LLM to generate the “real” output for the user or system, updating our memory with the final result in case future steps rely on it.
For example, a user asks for a recipe for grilled chicken. The question is embedded and an agent retrieves recipes from NYT Cooking related to the embedded terms. The retrieved context is fed to the LLM as context, and the LLM provides an output to the user.
ContextInclusion
- check if the final LLM output string contains a required keyword or phrase, indicating that it successfully incorporated the retrieved context. Since the point of an augmented LLM is to show that it used external information effectively, the augmentation step is wasted if the final output ignores the external information.
Factuality
- use the Factuality scorer from autoevals, or use it as a starting point and add your own custom logic. Augmentation is supposed to reduce hallucinations by leveraging real data, soFactuality
makes sure that the retrieval actually improved correctness.
In this code snippet, we also show how to run an evaluation on Braintrust via the SDK using the Factuality
scorer. For more information, check out the documentation.
RelevanceJudge
- qualitatively judge if the solution is relevant and using the retrieved info correctly. Sometimes, an LLM might parrot key words without actually understanding the user’s input.
Prompt chaining
A fixed series of LLM calls:
- Summarize
- Draft
- Refine
- And so on
For example, an agent first summarizes a user’s text, then writes a short draft, and finally refines it.
ExactMatch
- compare the output to an “expected” output. If each chain step is very structured or requires exact text, a simple “exact match” or “string similarity” at each step can be quite effective.
You could also use ExactMatch from autoevals.
StepByStepAccuracy
— instrument metadata such as a boolean for each step that checks the accuracy, then check the accuracy of the final output. In prompt chaining, upstream errors cascade. Checking each step ensures you can pinpoint exactly where the chain fails.
FlowCoherenceJudge
— read each step and decide if the chain logically flows and improves by the end. Some tasks require a more subjective measure. This is a good first step if you aren’t ready to incorporate human review.
Routing
- The system classifies user inputs into distinct routes
- Each route has its own specialized logic or sub-prompt
- If the agent picks the wrong route, the final output is likely incorrect
For example, an agent reads a customer request like “I want my money back” and decides whether to use the Refund flow or the General Inquiry flow.
RouteAccuracy
- checks if the agent’s chosen route (found inoutput
) matches theexpected
route label. If the wrong route is chosen at the start, the final answer will be wrong.
DownstreamTaskQuality
- qualitatively checks whether, after picking a route, the final response actually solves the user’s problem. If you have a super nuanced set of routes, like multiple categories of customer support, an LLM might be better at determining accuracy than code.
Parallelization
Multiple LLM calls happen at once, either:
- Each doing a different section of a task
- Producing multiple candidate outputs and then “voting” on the best
The final step merges or selects from these parallel results.
For example, an agent splits a long text into two halves, processes each half with a separate LLM call, then merges them into a single summary.
MergeCoherenceJudge
- uses an LLM to see if the combined sections produce a cohesive final text. Merging partial outputs can lead to repetition or writing styles that aren't combined cohesively.
VotingConsensusCheck
- if the system collects multiple candidates, check how many of them match the final chosen answer. If the final answer was “voted in,” it’s important to measure how strongly the candidates agreed.
ParallelCostCheck
- returns 1 if the total token usage for parallel calls did not exceed a threshold. Parallelization can balloon your token usage. This lets you keep an eye on cost-performance tradeoffs.
Orchestrator-workers
- An “orchestrator” LLM breaks a large or ambiguous request into subtasks
- “Worker” LLMs each handle a subtask, returning partial outputs
- The orchestrator merges them into a final result
For example, an agent breaks a coding request into multiple file edits, each file handled by a different worker, and then merges them into a single pull request.
SubtaskCoverage
- checks if the final result includes all required subtasks (listed inexpected.subtasks
). If the orchestrator is supposed to handle["Implement function A","Implement function B"]
but the final text doesn’t mention function B, we fail.
PartialAccuracy
- scores the correctness of each “worker” subtask, stored inmetadata.workerOutputs
. Each subtask might have a known correct snippet. This helps you pinpoint whether specific workers performed well.
FinalMergeCoherence
- checks if the orchestrator’s final merge step produced a cohesive, non-redundant result. Even if each worker subtask is correct, the final step might accidentally produce a contradictory result.
Evaluator-optimizer
- One LLM (“optimizer”) attempts a solution
- Another LLM (“evaluator”) critiques it and provides feedback
- The optimizer refines the answer, repeating until it meets certain criteria or hits an iteration limit
For example, an agent tries to write a short poem. The evaluator LLM says “Needs more vivid imagery.” The agent modifies it, and so on.
ImprovementCheck
- compares the initial draft (stored in metadata) and the final output, deciding if the final is “significantly improved.” If the final text is basically the same as the first, the evaluator’s feedback wasn’t used effectively.
IterationCount
- logs how many times we looped (used iterations). Scores 1 if it’s less than or equal to some max. This makes sure the agent doesn’t iterate forever.
FeedbackSpecificity
- rates the evaluator’s feedback (stored in metadata) for clarity. For example, is it more than 20 characters, or does it mention specific improvements? If the evaluator always says “Looks good!” the “optimization” loop won’t help.
Fully autonomous agent
- The agent decides each step on its own, calling tools, asking the user for more info, or finishing the task.
- It can run for many steps or until it hits a maximum iteration/cost guardrail.
For example, consider an agent that books travel. It calls an AirlineAPI
tool to find flights, then a PaymentAPI
tool to complete the booking. It continues working step-by-step until it completes the task or encounters a guardrail.
Autonomous agents are difficult to evaluate because they can contain any number of the above agentic systems within them. As you choose scorers for your autonomous agents, consider each step the agent takes and what scorers from earlier examples might be useful for that step.
StepLimitCheck
- simple scorer to ensure the agent’s step count is less than or equal to 5. Autonomy can lead to runaway loops.
ComplianceCheck
- another LLM checks the final log or output for policy violations (harassment, disallowed content, etc.). Fully autonomous agents can easily do the wrong thing without direct guardrails.
TaskSuccessRate
- checks whether the agent claims success (like “Booking confirmed”) or ifmetadata.successClaimed
is true. At the end, we need to check if the agent actually finished the job.
Best practices
Ultimately, choosing the right set of scorers will depend on the exact setup of your agentic system. In addition to the specific examples for the types of agents above, here’s some general guidance for how to choose the right evaluation metrics.
Code-based scorers are great for:
- Exact or binary conditions
- Did the system pick the “customer support” route?
- Did it stay under 5 steps?
- Numeric comparisons
- Numeric difference from the expected output
- Structured or factual checks
- Is the final code snippet error-free?
LLM-as-a-judge scorers are best when:
- You need subjective or contextual feedback
- Did the agent output a coherent paragraph?
- Human-like interpretation is needed to decide if the agent responded politely or thoroughly
- You want to check improvement across multiple drafts
Autoevals are useful for:
- Basic correctness (Factuality)
- QA tasks (ClosedQA)
- Similarity checks (EmbeddingSimilarity)
Custom scorers let you:
- Incorporate domain-specific knowledge
- Checking that a generated invoice meets certain business rules
- Evaluate multi-step flows
- Partial checks, iteration loops
- Implement your own specialized logic
- Analyzing a chain-of-thought or verifying references in a research doc
Over time, you can refine or replace scorers as you learn more about the real-world behaviors of your agent at scale.
Next steps
When you’re happy with your scorers, you can deploy them at scale on production logs by configuring online evaluation. Online evaluation runs your scoring functions asynchronously as you upload logs.
If your agentic system is extremely complex, you may want to incorporate human review. This can take the form of incorporating user feedback, or having your product team or subject matter experts manually evaluate your LLM outputs. You can use human review to evaluate/compare experiments, assess the efficacy of your automated scoring methods, and curate log events to use in your evals.