Experiments

Experiments let you snapshot the performance of your AI application so you can improve it over time. In traditional software, performance usually refers to speed, like for example, how many milliseconds it takes to complete a request. In AI, it often refers to other measurements in addition to speed, including accuracy or quality. These types of metrics are harder to define and measure, especially at scale. Assessing the performance of an LLM application is known as evaluation.

Braintrust supports two types of evaluations:

Offline evals are structured experiments used to compare and improve your app systematically.
Online evals run scorers on live requests to monitor performance in real time.

Both types of evals are important for building quality AI applications.

Eval Screenshot

Why are evals important?

In AI development, it's hard for teams to understand how an update will impact performance. This breaks the typical software development loop, making iteration feel like guesswork instead of engineering.

Evaluations solve this, helping you distill the non-deterministic outputs of AI applications into an effective feedback loop that enables you to ship more reliable, higher quality products.

Specifically, great evals help you:

Understand whether an update is an improvement or a regression
Quickly drill down into good / bad examples
Diff specific examples vs. prior runs
Avoid playing whack-a-mole

Breaking down evals

Evals consist of 3 parts:

Data: a set of examples to test your application on
Task: the AI function you want to test (any function that takes in an input and returns an output)
Scores: a set of scoring functions that take an input, output, and optional expected value and compute a score

You can establish an Eval() function with these 3 pieces:

import { Eval } from "braintrust";
import { Levenshtein } from "autoevals";
 
Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: () => {
      return [
        {
          input: "Foo",
          expected: "Hi Foo",
        },
        {
          input: "Bar",
          expected: "Hello Bar",
        },
      ]; // Replace with your eval dataset
    },
    task: async (input) => {
      return "Hi " + input; // Replace with your LLM call
    },
    scores: [Levenshtein],
  },
);

For more details, try the full tutorial.

Viewing experiments

Running your Eval function will automatically create an experiment in Braintrust, display a summary in your Terminal, and populate the UI:

Eval in UI

This gives you great visibility into how your AI application performed. You can:

Preview each test case and score in a table
Filter by high/low scores
Click into any individual example and see detailed tracing
See high level scores
Sort by improvements or regressions

Where to go from here

Now that you understand the basics of evals and experiments, you can dive deeper into the following topics: