Skip to main content
Evaluation lets you measure AI application performance systematically, turning non-deterministic outputs into an effective feedback loop. Run experiments to understand whether changes improve or regress quality, drill into specific examples, and avoid playing whack-a-mole with issues. Eval Screenshot

Why evaluate

In AI development, it’s hard to understand how updates impact performance. This breaks typical software workflows, making iteration feel like guesswork instead of engineering. Evaluations solve this by helping you:
  • Understand whether an update improves or regresses performance
  • Quickly drill down into good and bad examples
  • Diff specific examples versus prior runs
  • Catch regressions before they reach production
  • Build confidence in your changes

Offline vs. online evaluation

Braintrust supports two complementary modes of evaluation that work together to ensure quality throughout the development lifecycle.

Offline evaluation (experiments)

Run structured experiments during development to compare approaches systematically. Test changes against curated datasets before deployment, compare prompts or models side-by-side, and catch regressions in CI/CD. Offline evaluation helps you ship better changes by validating improvements before they reach production. Key workflows:

Online evaluation (production scoring)

Monitor production quality by scoring live requests automatically. Evaluate real user interactions at scale, catch regressions immediately, and identify new edge cases for offline testing. Online evaluation helps you maintain quality by continuously monitoring production behavior. Key workflows:

Continuous feedback loop

Both modes use the same scorer library, enabling a continuous improvement cycle:
  1. Develop and test with offline experiments.
  2. Deploy changes with confidence.
  3. Monitor production with online scoring.
  4. Feed production insights back into datasets for offline testing.

Anatomy of an evaluation

Every evaluation consists of three parts:

Data

A dataset of test cases containing inputs, expected outputs (optional), and metadata. Build datasets from production logs, user feedback, or manual curation. Learn about datasets

Task

The AI function you want to test - any function that takes an input and returns an output. This is typically an LLM call, but can be any logic you want to evaluate.

Scores

Scoring functions that measure quality by comparing inputs, outputs, and expected values. Use automated scorers like factuality or similarity, LLM-as-a-judge scorers, or custom code-based logic. Learn about scorers

Run evaluations

Use the Eval() function to run experiments:
import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("Say Hi Bot", {
  data: () => [
    { input: "Foo", expected: "Hi Foo" },
    { input: "Bar", expected: "Hello Bar" },
  ],
  task: async (input) => {
    return "Hi " + input; // Replace with your LLM call
  },
  scores: [Factuality],
});
Running your eval automatically creates an experiment, displays a summary in your terminal, and populates the UI. Learn how to run evaluations

Interpret results

The experiment view shows:
  • Summary metrics for all scores
  • Table of test cases with individual scores
  • Detailed traces for each example
  • Comparisons to baseline experiments
  • Improvements and regressions highlighted
Filter by high or low scores, sort by changes, and drill into specific examples to understand behavior. Learn how to interpret results

Compare experiments

Run multiple experiments to compare approaches:
  • Different prompts or models
  • Various parameter configurations
  • Alternative architectures or flows
  • Before and after code changes
Braintrust highlights score differences and shows which test cases improved or regressed. Learn how to compare experiments

Use playgrounds

Playgrounds provide a no-code environment for rapid experimentation:
  • Test prompts and models interactively
  • Run evaluations on datasets without code
  • Compare results side-by-side
  • Share configurations with teammates
Use playgrounds for quick iteration, then codify winning approaches in your application. Learn how to use playgrounds

Write effective components

Create high-quality evaluation components:
  • Prompts: Clear instructions that guide model behavior
  • Scorers: Reliable functions that measure what matters
  • Datasets: Representative examples covering edge cases
Use Loop to generate and optimize these components based on your production data.

Next steps