
Why evaluate
In AI development, it’s hard to understand how updates impact performance. This breaks typical software workflows, making iteration feel like guesswork instead of engineering. Evaluations solve this by helping you:- Understand whether an update improves or regresses performance
- Quickly drill down into good and bad examples
- Diff specific examples versus prior runs
- Catch regressions before they reach production
- Build confidence in your changes
Offline vs. online evaluation
Braintrust supports two complementary modes of evaluation that work together to ensure quality throughout the development lifecycle.Offline evaluation (experiments)
Run structured experiments during development to compare approaches systematically. Test changes against curated datasets before deployment, compare prompts or models side-by-side, and catch regressions in CI/CD. Offline evaluation helps you ship better changes by validating improvements before they reach production. Key workflows:- Run evaluations with the
Eval()function - Interpret results to find improvements and regressions
- Compare experiments to measure impact
- Use playgrounds for rapid iteration
Online evaluation (production scoring)
Monitor production quality by scoring live requests automatically. Evaluate real user interactions at scale, catch regressions immediately, and identify new edge cases for offline testing. Online evaluation helps you maintain quality by continuously monitoring production behavior. Key workflows:- Score production traces with automatic scoring rules
- Monitor with dashboards to track metrics over time
- Deploy with monitoring to catch issues in production
Continuous feedback loop
Both modes use the same scorer library, enabling a continuous improvement cycle:- Develop and test with offline experiments.
- Deploy changes with confidence.
- Monitor production with online scoring.
- Feed production insights back into datasets for offline testing.
Anatomy of an evaluation
Every evaluation consists of three parts:Data
A dataset of test cases containing inputs, expected outputs (optional), and metadata. Build datasets from production logs, user feedback, or manual curation. → Learn about datasetsTask
The AI function you want to test - any function that takes an input and returns an output. This is typically an LLM call, but can be any logic you want to evaluate.Scores
Scoring functions that measure quality by comparing inputs, outputs, and expected values. Use automated scorers like factuality or similarity, LLM-as-a-judge scorers, or custom code-based logic. → Learn about scorersRun evaluations
Use theEval() function to run experiments:
Interpret results
The experiment view shows:- Summary metrics for all scores
- Table of test cases with individual scores
- Detailed traces for each example
- Comparisons to baseline experiments
- Improvements and regressions highlighted
Compare experiments
Run multiple experiments to compare approaches:- Different prompts or models
- Various parameter configurations
- Alternative architectures or flows
- Before and after code changes
Use playgrounds
Playgrounds provide a no-code environment for rapid experimentation:- Test prompts and models interactively
- Run evaluations on datasets without code
- Compare results side-by-side
- Share configurations with teammates
Write effective components
Create high-quality evaluation components:- Prompts: Clear instructions that guide model behavior
- Scorers: Reliable functions that measure what matters
- Datasets: Representative examples covering edge cases
Next steps
- Run evaluations with the Eval function
- Write prompts that guide model behavior
- Write scorers to measure quality
- Use playgrounds for rapid iteration