Visualize and interpret eval results

View results in the UI

Running an eval from the API or SDK will return a link to the corresponding results in Braintrust's UI. When you open the link, you'll land on a detailed view of the eval run that you selected. The detailed view includes:

Diff mode toggle - Allows you to compare eval runs to each other. If you click the toggle, you will see the results of your current eval compared to the results of the baseline.
Filter bar - Allows you to focus in on a subset of test cases. You can filter by typing natural language or BTQL.
Column visibility - Allows you to toggle column visibility. You can also order columns by regressions to hone in on problematic areas.
Table - Shows the data for every test case in your eval run.

One eval run

Experiment summaries

When you select an experiment, you'll get a summary of the comparisons, scorers, datasets, and metadata. Experiment summary

You can also view and copy the experiment ID from the bottom of the summary pane.

Table header summaries

Summaries will appear for score and metric columns. To find test cases to focus on, use column header summaries to filter by improvements or regressions (test cases that decreased in score). This allows you to see the scorers with the biggest issues. You can also group the table to view summaries across metadata fields or inputs. For example, if you use separate datasets for distinct types of usecases, you can group by dataset to see which usecases are having the biggest issues.

Group summaries

By default, group rows will show one experiment's summary data, and you can switch between them by selecting your desired aggregation.

Summary experiment aggregations

If you would like to view the summary data for all experiments, select Include comparisons in group.

Within a grouped table, you can also sort rows by regressions of a specific score relative to a comparison experiment.

Now that you've narrowed your test cases, you can view a test case in detail by selecting a row.

Trace view

Selecting a row will open the trace view. Here you can see all of the data for the trace for this test case, including input, output, metadata, and metrics for each span inside the trace.

Look at the scores and the output and decide whether the scores seem "right". Do good scores correspond to a good output? If not, you'll want to improve your evals by updating scorers or test cases.

Trace view

Create custom columns

You can create custom columns to extract specific values from input, output, expected, or metadata fields if they are objects. To do this, use the Add custom column option at the bottom of the Columns dropdown or select the + icon at the end of the table headers.

Create column action

After naming your custom column, you can either choose from the inferred fields in the dropdown or enter a custom BTQL statement.

Once created, you can filter and sort the table using your custom columns.

Interpreting results

How metrics are calculated

Along with the scores you track, Braintrust tracks a number of metrics about your LLM calls that help you assess and understand performance. For example, if you're trying to figure out why the average duration increased substantially when you change a model, it's useful to look at both duration and token metrics to diagnose the underlying issue.

Wherever possible, metrics are computed on the task subspan, so that LLM-as-a-judge calls are excluded. Specifically:

Duration is the duration of the "task" span.
Offset is the time elapsed since the trace start time.
Prompt tokens, Completion tokens, Total tokens, LLM duration, and Estimated LLM cost are averaged over every span that is not marked with span_attributes.purpose = "scorer", which is set automatically in autoevals.

If you are using the logging SDK, or API, you will need to follow these conventions to ensure that metrics are computed correctly.

To compute LLM metrics (like token counts), make sure you wrap your LLM calls.

Diff mode

When you run multiple experiments, Braintrust will automatically compare the results of experiments to each other. This allows you to quickly see which test cases improved or regressed across experiments.

You can also select any individual row in an experiment to see diffs for each field in a span.

How rows are matched

By default, Braintrust considers two test cases to be the same if they have the same input field. This is used both to match test cases across experiments and to bucket equivalent cases together in a trial.

Viewing data across trials

To group by trials, or multiple rows with the same input value, select Input from the Group dropdown menu. This will consolidate each trial for a given input and display aggregate data, showing comparisons for each unique input across all experiments.

If Braintrust detects that any rows have the same input value within the same experiment, diff mode will show a Trials column where you can select matching trials in your comparison experiments. You can also step through the relevant trial rows in your comparison experiment by selecting a specific trace.

Customizing the comparison key

However, sometimes your input may include additional data, and you need to use a different expression to match test cases. You can configure the comparison key in your project's Configuration page.

Experiment view layouts

Grid layout

When you run multiple experiments, you can also compare experiment outputs side-by-side in the table by selecting the Grid layout. In the grid layout, select which fields to display in cells by selecting from the Fields dropdown menu.

Summary layout

The Summary layout summarizes scores and metrics across the base experiment and all comparison experiments, in a reporting-friendly format with large type. Both summary and grid layouts respect all view filters.

Aggregate (weighted) scores

It's often useful to compute many, even hundreds, of scores in your experiments, but when reporting on an experiment, or comparing experiments over time, it's often useful to have a single score that represents the experiment as a whole.

Braintrust allows you to do this with aggregate scores, which are formulas that combine multiple scores. To create an aggregate score, go to your project's Configuration page, and select Add aggregate score.

Braintrust currently supports three types of aggregate scores:

Weighted average - A weighted average of selected scores.
Minimum - The minimum value among the selected scores.
Maximum - The maximum value among the selected scores.

import { init } from "braintrust";
 
async function openExperiment() {
  const experiment = init(
    "Say Hi Bot", // Replace with your project name
    {
      experiment: "my-experiment", // Replace with your experiment name
      open: true,
    },
  );
  for await (const testCase of experiment) {
    console.log(testCase);
  }
}

You can use the the asDataset()/as_dataset() function to automatically convert the experiment into the same fields you'd use in a dataset (input, expected, and metadata).

import { init } from "braintrust";
 
async function openExperiment() {
  const experiment = init(
    "Say Hi Bot", // Replace with your project name
    {
      experiment: "my-experiment", // Replace with your experiment name
      open: true,
    },
  );
 
  for await (const testCase of experiment.asDataset()) {
    console.log(testCase);
  }
}

For a more advanced overview of how to reuse experiments as datasets, see Hill climbing.

Visualize and interpret eval results

View results in the UI

Experiment summaries

Table header summaries

Group summaries

Trace view

Create custom columns

Interpreting results

How metrics are calculated

Diff mode

How rows are matched

Viewing data across trials

Customizing the comparison key

Experiment view layouts

Grid layout

Summary layout

Aggregate (weighted) scores

Analyze across experiments

Bar chart

Scatter plot

Export experiments

UI

API

SDK

On this page