Evaluating SimpleQA
We're going to evaluate a simple QA system in Braintrust using SimpleQA, an open-source dataset from OpenAI. We'll also use autoevals, our built-in library for evaluating AI model outputs. By the time you finish this example, you'll learn how to define and use custom evaluation metrics, compare evals that use different models, and analyze results in Braintrust.
Setup
Before getting started, make sure you have a Braintrust account and an API key for OpenAI. Make sure to plug the OpenAI key into your Braintrust account's AI providers configuration and acquire a BRAINTRUST_API_KEY. In this cookbook, we'll be comparing GPT-4o to Claude 3.5 Sonnet, so if you'd like to follow along, add an API key for Anthropic to your Braintrust account as well. Or, you can add an API key for any other AI provider you'd like and follow the same process. Lastly, add your BRAINTRUST_API_KEY
to your Python environment, or just hardcode it into the code below.
Install dependencies
Everything you need to run evals is readily available through Braintrust. We'll use the AI proxy to access multiple AI models without having to write model-specific code. Run the following command to install required libraries.
Preparing the dataset
We'll use a QA dataset available online. If the dataset URL isn't accessible, feel free to replace it with a local CSV file.
First, we'll load in the dataset and print a confirmation statement to confirm we're ready for the next step.
Parse and transform the dataset
Next, we'll parse the raw CSV data into a Python list of dictionaries, ensuring that any metadata stored as strings is converted into usable Python objects. This transformation prepares the dataset for evaluation tasks. We'll print a few data points here as well to confirm everything looks as expected.
Format the data
Lastly, we need to format the data for Braintrust. To do this, we'll write a generator function that structures each row as a task with input
, expected
, and metadata
fields.
Define the model task
Now that our data is ready, we'll generate responses to the QA tasks using an LLM call. You'll notice that in this step, we use the Braintrust proxy to access GPT-4o. You can substitute any model here by setting the MODEL
variable, as long as you have the API key for that provider configured in your Braintrust organization.
Here is the task definition:
Create a scoring function
To assess the performance of our QA system, we'll define a custom LLM-as-a-judge scoring function using the LLMClassifier from autoevals
as a starting point. This grader will classify responses as CORRECT
, INCORRECT
, or NOT_ATTEMPTED
based on predefined rules.
Run the evaluation
With the dataset, scoring function, and task defined, we're ready to run our eval:
Analyze results
Braintrust will print a summary of your eval, but to analyze the full results, you'll need to visit the Braintrust dashboard by opening the printed link, or navigating to Braintrust, selecting the SimpleQA project, and navigating to the Evaluations tab.
If you look at the score distribution chart, you’ll notice that the Grader either gave a score of 100% or 0%, averaging out to 50% across the 10 datapoints.
Comparing models
Let's swap out the model and see if we get different results. Set the MODEL
variable to claude-3-5-sonnet-latest
and rerun the evaluation cell above. Now when you go to Braintrust, you can directly compare the results of the experiments.
While the new model scored better on some of the datapoints, it regressed on others.
Next steps
From here, there are a few different things you could do to improve the score of your QA system. You could:
- Switch out the model again and see if you get different results
- Dig into the traces in Braintrust and examine if the scoring function is working as intended
- Edit the scoring function
- Run the experiment on a larger dataset
The way we’ve set up the experiment here makes it easy to switch out the LLM and compare results across models, examine your evaluation more thoroughly in the UI, and add more data points to your evaluation dataset. Give it a try!