Comparing evals across multiple AI models
This tutorial will teach you how to use Braintrust to compare the same prompts across different AI models and parameters to help decide on choosing a model to run your AI apps.
Before starting, please make sure that you have a Braintrust account. If you do not, please sign up. After this tutorial, feel free to dig deeper by visiting the docs.
Installing dependencies
To see a list of dependencies, you can view the accompanying package.json file. Feel free to copy/paste snippets of this code to run in your environment, or use tslab to run the tutorial in a Jupyter notebook.
Setting up the data
For this example, we will use a small subset of data taken from the google/boolq dataset. If you'd like, you can try datasets and prompts from any of the other cookbooks at Braintrust.
Running comparison evals across multiple models
Let's set up some code to compare these prompts and inputs across 3 different models and different temperature values. For this cookbook we will be using Braintrust's LLM proxy to access the API for different models.
All we need to do is provide a baseURL
to the proxy with the relevant API key that we want to access, and the use the wrapOpenAI
function from braintrust which will help us capture helpful debugging information about each model's performance while keeping the same SDK interface across all models.
Then we will set up our eval data for each combination of model, prompt and temperature.
Let's use the functions and data that we have set up to run some evals on Braintrust! We will be using two scorers for this eval:
- A simple exact match scorer that will compare the output from the LLM exactly with the expected value
- A Levenshtein scorer which will calculate the Levenshtein distance between the LLM output and our expected value
We are also adding the model, temperature, and prompt into the metadata so that we can use those fields to help our visualization inside the braintrust app after the evals are finished running.
Visualizing
Now we have successfully run our evals! Let's log onto braintrust.dev and take a look at the results.
Click into the newly generated project called Model comparison
, and check it out! You should notice a few things:
- Each line represents a score over time, and each data point represents an experiment that was run.
- From the code, we ran 60 experiments (5 temperature values x 4 models x 3 prompts) so one line should consist of 60 dots, each with a different combination of temperature, model, and prompt.
- Metadata fields are automatically populated as viable X axis values.
- Metadata fields with numeric values are automatically populated as viable Y axis values.
Diving in
This chart allows us to also group data to allow us to compare experiment runs by model, prompt, and temperature.
By selecting X Axis prompt
, we can see pretty clearly that the longer prompt performed better than the shorter ones.
By selecting the one color per model
and X Axis model
, we can also visualize performance between different models. From this view we can see that the OpenAI models outperformed the Anthropic models.
Let's see if we can find any differences between the OpenAI models by selecting the one color per model
, one symbol per prompt
, and X Axis temperature
.
In this view, we can see that gpt-4
performed better than gpt-4o
at higher temperatures!
Parting thoughts
This is just the start of evaluating and improving your AI applications. From here, you should run more experiments with larger datasets, and also try out different prompts! Once you have run another set of experiments, come back to the chart and play with the different views and groupings. You can also add filtering to filter for experiments with specific scores and metadata to find even more insights.
Happy evaluating!