We're going to evaluate a simple QA system in Braintrust using SimpleQA, an open-source dataset from OpenAI. We'll also use autoevals, our built-in library for evaluating AI model outputs. By the time you finish this example, you'll learn how to define and use custom evaluation metrics, compare evals that use different models, and analyze results in Braintrust.
Before getting started, make sure you have a Braintrust account and an API key for OpenAI. Make sure to plug the OpenAI key into your Braintrust account's AI providers configuration and acquire a BRAINTRUST_API_KEY. In this cookbook, we'll be comparing GPT-4o to Claude 3.5 Sonnet, so if you'd like to follow along, add an API key for Anthropic to your Braintrust account as well. Or, you can add an API key for any other AI provider you'd like and follow the same process. Lastly, add your BRAINTRUST_API_KEY to your Python environment, or just hardcode it into the code below.
Everything you need to run evals is readily available through Braintrust. We'll use the AI proxy to access multiple AI models without having to write model-specific code. Run the following command to install required libraries.
Next, we'll parse the raw CSV data into a Python list of dictionaries, ensuring that any metadata stored as strings is converted into usable Python objects. This transformation prepares the dataset for evaluation tasks. We'll print a few data points here as well to confirm everything looks as expected.
Lastly, we need to format the data for Braintrust. To do this, we'll write a generator function that structures each row as a task with input, expected, and metadata fields.
Now that our data is ready, we'll generate responses to the QA tasks using an LLM call. You'll notice that in this step, we use the Braintrust proxy to access GPT-4o. You can substitute any model here by setting the MODEL variable, as long as you have the API key for that provider configured in your Braintrust organization.
To assess the performance of our QA system, we'll define a custom LLM-as-a-judge scoring function using the LLMClassifier from autoevals as a starting point. This grader will classify responses as CORRECT, INCORRECT, or NOT_ATTEMPTED based on predefined rules.
Braintrust will print a summary of your eval, but to analyze the full results, you'll need to visit the Braintrust dashboard by opening the printed link, or navigating to Braintrust, selecting the SimpleQA project, and navigating to the Evaluations tab.
If you look at the score distribution chart, you’ll notice that the Grader either gave a score of 100% or 0%, averaging out to 50% across the 10 datapoints.
Let's swap out the model and see if we get different results. Set the MODEL variable to claude-3-5-sonnet-latest and rerun the evaluation cell above. Now when you go to Braintrust, you can directly compare the results of the experiments.
While the new model scored better on some of the datapoints, it regressed on others.
From here, there are a few different things you could do to improve the score of your QA system. You could:
Switch out the model again and see if you get different results
Dig into the traces in Braintrust and examine if the scoring function is working as intended
Edit the scoring function
Run the experiment on a larger dataset
The way we’ve set up the experiment here makes it easy to switch out the LLM and compare results across models, examine your evaluation more thoroughly in the UI, and add more data points to your evaluation dataset. Give it a try!