Scorers

Scorers in Braintrust allow you to evaluate the output of LLMs based on a set of criteria. These can include both heuristics (expressed as code) or prompts (expressed as LLM-as-a-judge). Scorers help you assign a performance score between 0 and 100% to assess how well the AI outputs match expected results. While many scorers are available out of the box in Braintrust, you can also create your own custom scorers directly in the UI or upload them via the command line. Scorers that you define in the UI can also be used as functions.

Autoevals

There are several pre-built scorers available via the open-source autoevals library, which offers standard evaluation methods that you can start using immediately.

Autoeval scorers offer a strong starting point for a variety of evaluation tasks. Some autoeval scorers require configuration before they can be used effectively. For example, you might need to define expected outputs or certain parameters for specific tasks. To edit an autoeval scorer, you must copy it first.

While autoevals are a great way to get started, you may eventually need to create your own custom scorers for more advanced use cases.

Custom scorers

You can create custom scorers in TypeScript, Python, or as an LLM-as-a-judge through the UI by navigating to Library > Scorers and selecting Create scorer. These scorers will be available to use as functions throughout your project. You can also upload custom scorers from the command line.

TypeScript and Python scorers

For more specialized evals, you can create custom scorers in either TypeScript or Python. These code-based scorers are highly customizable and can return scores based on your exact requirements. Simply add your custom code to the TypeScript or Python tabs, and it will run in a sandboxed environment.

Create TypeScript scorer

This command will bundle and upload your custom scorer functions, making them accessible across your Braintrust projects.

LLM-as-a-judge scorers

In addition to code-based scorers, you can also create LLM-as-a-judge scorers through the UI. For an LLM-as-a-judge scorer, you define a prompt that evaluates the AI's output and maps its choices to specific scores. You can also configure whether to use techniques like chain-of-thought (CoT) reasoning for more complex evaluations.

Create LLM-as-a-judge scorer

Using a scorer in the UI

You can use both autoevals and custom scorers in the Braintrust Playground. In your playground, navigate to Scorers and select from the list of available scorers. You can also create a new custom scorer from this menu.

Using scorer in playground

The Playground allows you to iterate quickly on prompts while running evaluations, making it the perfect tool for testing and refining your AI models and prompts.

Pushing scorers via the CLI

As with tools, when writing custom scorers in the UI, there may be restrictions on certain imports or functionality, but you can always write your scorers in your own environment and upload them for use in Braintrust. This works for both code-based scorers and LLM-as-a-judge scorers.

import braintrust from "braintrust";
import { z } from "zod";
 
const project = braintrust.projects.create({ name: "scorer" });
 
project.scorers.create({
  name: "Equality scorer",
  slug: "equality-scorer",
  description: "An equality scorer",
  parameters: z.object({
    output: z.string(),
    expected: z.string(),
  }),
  handler: async ({ output, expected }) => {
    return output == expected ? 1 : 0;
  },
});
 
project.scorers.create({
  name: "Equality LLM scorer",
  slug: "equality-llm-scorer",
  description: "An equality LLM scorer",
  messages: [
    {
      role: "user",
      content:
        'Return "A" if {{output}} is equal to {{expected}}, and "B" otherwise.',
    },
  ],
  model: "gpt-4o",
  useCot: true,
  choiceScores: {
    A: 1,
    B: 0,
  },
});

Pushing to Braintrust

Once you define a scorer, you can push it to Braintrust with braintrust push:

npx braintrust push scorer.ts

Dependencies

Braintrust will take care of bundling the dependencies your scorer needs.

In TypeScript, we use esbuild to bundle your code and its dependencies together. This works for most dependencies, but it does not support native (compiled) libraries like SQLite.

If you have trouble bundling your dependencies, let us know by filing an issue.

Syncing scorers via the SDK

Syncing scorers via the SDK is not currently supported - this feature is coming soon!

On this page