Ship LLM products that work.

Braintrust is the end-to-end platform for building world-class AI apps.

Trusted by AI teams at

Evaluate your prompts and models

Non-deterministic models and unpredictable natural language inputs make building robust LLM applications difficult. Adapt your development lifecycle for the AI era with Braintrust's iterative LLM workflows.

Easily answer questions like “which examples regressed when we changed the prompt?” or “what happens if I try this new model?”

Translation
51
OpenAIo1 mini
1,539 TOK
8.07s
~$0.018
Levenshtein distance
67
MistralMistral Nemo
1,021 TOK
4.83s
~$0.002
Similarity
89
OpenAIGPT-4o
2,992 TOK
12.44s
~$0.010
Moderation
60
AnthropicClaude 3.5 Sonnet
1,958 TOK
11.24s
~$0.008
Security
54
GoogleGemini Pro
2,610 TOK
9.23s
~$0.008
Hallucination
33
MetaLlama 3.5
1,620 TOK
10.2s
~$0.014
Summary
29
PerplexitySonar large online
1,004 TOK
12.2s
~$0.004
Translation
51
OpenAIo1 mini
1,539 TOK
8.07s
~$0.018
Levenshtein distance
67
MistralMistral Nemo
1,021 TOK
4.83s
~$0.002
Similarity
89
OpenAIGPT-4o
2,992 TOK
12.44s
~$0.010
Moderation
60
AnthropicClaude 3.5 Sonnet
1,958 TOK
11.24s
~$0.008
Security
54
GoogleGemini Pro
2,610 TOK
9.23s
~$0.008
Hallucination
33
MetaLlama 3.5
1,620 TOK
10.2s
~$0.014
Summary
29
PerplexitySonar large online
1,004 TOK
12.2s
~$0.004

Anatomy of an eval

Braintrust evals are composed of three components—a prompt, scorers, and a dataset of examples.

Prompt

Prompt

Tweak LLM prompts from any AI provider, run them, and track their performance over time. Seamlessly and securely sync your prompts with your code.

Prompts guide
Scorers

Scorers

Use industry standard autoevals or write your own using code or natural language. Scorers take an input, the LLM output, and an expected value to generate a score.

Scorers guide
Dataset

Dataset

Capture rated examples from staging and production and incorporate them into “golden” datasets. Datasets are integrated, versioned, scalable, and secure.

Datasets guide

Features for everyone

Intuitively designed for both technical and non-technical team members, and synced between code and UI.

Traces

Traces

Visualize and analyze LLM execution traces in real-time to debug and optimize your AI apps.

Tracing guide
Monitoring

Monitoring

Monitor real-world AI interactions with insights to ensure your models perform optimally in production.

Logging and monitoring
Online evaluations

Online evals

Continuously evaluate with automatic, asynchronous server-side scoring as you upload logs.

Online evaluation docs
Functions

Functions

Define functions in TypeScript and Python, and use as custom scorers or callable tools.

Functions reference
Self-hosting

Self-hosting

Deploy and run Braintrust on your own infrastructure for full control over your data and compliance requirements.

Self-hosting guide

Join industry leaders

“Braintrust fills the missing (and critical!) gap of evaluating non-deterministic AI systems.”
Mike Knoop
Cofounder/Head of AI
Mike Knoop
“I’ve never seen a workflow transformation like the one that incorporates evals into ‘mainstream engineering’ processes before. It’s astonishing.”
Malte Ubl
CTO
Malte Ubl
“Braintrust finally brings end-to-end testing to AI products, helping companies produce meaningful quality metrics.”
Michele Catasta
President
Michele Catasta
“We log everything to Braintrust. They make it very easy to find and fix issues.”
Simon Last
Cofounder
Simon Last
“Every new AI project starts with evals in Braintrust—it’s a game changer.”
Lee Weisberger
Eng. Manager, AI
Lee Weisberger