Evaluate via UI
The following steps require access to a Braintrust organization, which represents a company or a team. Sign up to create an organization for free.
Configure your API keys
Navigate to the AI providers page in your settings and configure at least one API key. For this quickstart, be sure to add your OpenAI API key. After completing this initial setup, you can access models from many providers through a single, unified API.
For more advanced use cases where you want to use custom models or avoid plugging your API key into Braintrust, you may want to check out the SDK quickstart.
Create a new project
For every AI feature your organization is building, the first thing you'll do is create a project.
Create a new prompt
Navigate to Library in the top menu bar, then select Prompts. Create a new prompt in your project called "movie matcher". A prompt is the input you provide to the model to generate a response. Choose GPT 4o
for your model, and type this for your system prompt:
Select the + Message button below the system prompt, and enter a user message:
Prompts can use mustache templating syntax to refer to variables. In this case, the input corresponds to the movie description given by the user.
Select Save as custom prompt to save your prompt.
Explore the prompt playground
Scroll to the bottom of the prompt viewer, and select Create playground with prompt. This will open the prompt you just created in the prompt playground, a tool for exploring, comparing, and evaluating prompts. In the prompt playground, you can evaluate prompts with data from your datasets.
Importing a dataset
Open this sample dataset, and right-click to select Save as... and download it. It is a .csv
file with two columns, Movie Title and Original Description. Inside your playground, select Dataset, then Upload dataset, and upload the CSV file. Using drag and drop, assign the CSV columns to dataset fields. The input column corresponds to Original Description, and the expected column should be Movie Title. Then, select Import.
Choosing a scorer
A scoring function allows you to compare the expected output of a task to the actual output and produce a score between 0 and 1. Inside your playground, select Scorers to choose from several types of scoring functions. There are two main types of scoring functions: heuristics are great for well-defined criteria, while LLM-as-a-judge is better for handling more complex, subjective evaluations. You can also create a custom scorer. For this example, since there is a clear correct answer, we can choose ExactMatch.
Running your first evaluation
From within the playground, select + Experiment to set up your first evaluation. To run an eval, you need three things:
- Data: a set of examples to test your application on
- Task: the AI function you want to test (any function that takes in an input and returns an output)
- Scores: a set of scoring functions that take an input, output, and optional expected value and compute a score
In this example, the Data is the dataset you uploaded, the Task is the prompt you created, and Scores is the scoring function we selected.
Creating an experiment from the playground will automatically log your results to Braintrust.
Interpreting your results
Navigate to the Experiments page to view your evaluation. Examine the exact match scores and other feedback generated by your evals. If you notice that some of your outputs did not match what was expected, you can tweak your prompt directly in the UI until it consistently produces high-quality outputs. If changing the prompt doesn't yield the desired results, consider experimenting with different models.
As you iterate on your prompt, you can run more experiments and compare results.
Next steps
- Now that you've run your first evaluation, learn how to write your own eval script.
- Check out more examples and sample projects in the Braintrust Cookbook.
- Explore the guides to read more about evals, logging, and datasets.