Evaluate via UI

The following steps require access to a Braintrust organization, which represents a company or a team. Sign up to create an organization for free.

Configure your API keys

Navigate to the AI providers page in your settings and configure at least one API key. For this quickstart, be sure to add your OpenAI API key. After completing this initial setup, you can access models from many providers through a single, unified API.

Create a new project

For every AI feature your organization is building, the first thing you’ll do is create a project.

Create a new prompt

Navigate to the Prompts page to create a new prompt in your project called movie matcher. A prompt is the input you provide to the model to generate a response. Choose GPT 4o for your model, and type this for your system prompt:

Based on the following description, identify the movie title. In your response, simply provide the name of the movie.

Select the + Message button below the system prompt, and enter a user message:

{{input}}

Prompts can use mustache templating syntax to refer to variables. In this case, the input corresponds to the movie description given by the user.

First prompt

Select Save to save your prompt.

Explore the prompt playground

Scroll to the bottom of the prompt viewer, and select Create playground with prompt. This will open the prompt you just created in the prompt playground, a tool for exploring, comparing, and evaluating prompts. In the prompt playground, you can evaluate prompts with data from your datasets.

Prompt playground

Importing a dataset

Open this sample dataset, and right-click to select Save as... and download it. It is a .csv file with two columns, Movie Title and Original Description. Select Dataset, then Upload dataset, and upload the CSV file. Using drag and drop, assign the CSV columns to dataset fields. The input column corresponds to Original Description, and the expected column should be Movie Title. Then, select Import.

Upload dataset

Choosing a scorer

A scoring function allows you to compare the expected output of a task to the actual output and produce a score between 0 and 1. Select Scorers to choose from several types of scoring functions. There are two main types of scoring functions: heuristics are great for well-defined criteria, while LLM-as-a-judge is better for handling more complex, subjective evaluations. You can also create a custom scorer. For this example, since there is a clear correct answer, we can choose ExactMatch.

Running your first evaluation

From within the playground, select + Experiment to set up your first evaluation. To run an eval, you need three things:

  • Data: a set of examples to test your application on
  • Task: the AI function you want to test (any function that takes in an input and returns an output)
  • Scores: a set of scoring functions that take an input, output, and optional expected value and compute a score

In this example, the Data is the dataset you uploaded, the Task is the prompt you created, and Scores is the scoring function we selected.

Create experiment

Creating an experiment from the playground will automatically log your results to Braintrust.

Interpreting your results

Navigate to the Experiments page to view your evaluation. Examine the exact match scores and other feedback generated by your evals. If you notice that some of your outputs did not match what was expected, you can tweak your prompt directly in the UI until it consistently produces high-quality outputs. If changing the prompt doesn't yield the desired results, consider experimenting with different models. Experiment

As you iterate on your prompt, you can run more experiments and compare results.

Next steps

On this page