AI Search Bar
This guide demonstrates how we developed Braintrust's AI-powered search bar, harnessing the power of Braintrust's evaluation workflow along the way. If you've used Braintrust before, you may be familiar with the project page, which serves as a home base for collections of eval experiments:
To find a particular experiment, you can type filter and sort queries into the search bar, using standard SQL syntax. But SQL can be finicky -- it's very easy to run into syntax errors like single quotes instead of double, incorrect JSON extraction syntax, or typos. Users would prefer to just type in an intuitive search like experiments run on git commit 2a43fd1
or score under 0.5
and see a corresponding SQL query appear automatically. Let's achieve this using AI, with assistance from Braintrust's eval framework.
We'll start by installing some packages and setting up our OpenAI client.
Load the data and render the templates
When we ask GPT to translate a search query, we have to account for multiple output options: (1) a SQL filter, (2) a SQL sort, (3) both of the above, or (4) an unsuccessful translation (e.g. for a nonsensical user input). We'll use function calling to robustly handle each distinct scenario, with the following output format:
match
: Whether or not the model was able to translate the search into a valid SQL filter/sort.filter
: AWHERE
clause.sort
: AnORDER BY
clause.explanation
: Explanation for the choices above -- this is useful for debugging and evaluation.
Prepare prompts for evaluation in Braintrust
Let's evaluate two different prompts: a shorter prompt with a brief explanation of the problem statement and description of the experiment schema, and a longer prompt that additionally contains a feed of example cases to guide the model. There's nothing special about either of these prompts, and that's OK -- we can iterate and improve the prompts when we use Braintrust to drill down into the results.
One detail worth mentioning: each prompt contains a stub for dynamic insertion of the data schema. This is motivated by the need to handle semantic searches like more than 40 examples
or score < 0.5
that don't directly reference a column in the base table. We need to tell the model how the data is structured and what each fields actually means. We'll construct a descriptive schema using pydantic and paste it into each prompt to provide the model with this information.
Our prompts are ready! Before we run our evals, we just need to load some sample data and define our scoring functions.
Load sample data
Let's load our examples. Each example case contains input
(the search query) and expected
(function call output).
Let's also split the examples into a training set and test set. For now, this won't matter, but later on when we fine-tune the model, we'll want to use the test set to evaluate the model's performance.
Insert our examples into a Braintrust dataset so we can introspect and reuse the data later.
Define scoring functions
How do we score our outputs against the ground truth queries? We can't rely on an exact text match, since there are multiple correct ways to translate a SQL query. Instead, we'll use two approximate scoring methods: (1) SQLScorer
, which roundtrips each query through json_serialize_sql
to normalize before attempting a direct comparison, and (2) AutoScorer
, which delegates the scoring task to gpt-4
.
Run the evals!
We'll use the Braintrust Eval
framework to set up our experiments according to the prompts, dataset, and scoring functions defined above.
Let's try it on one example before running an eval.
We're ready to run our evals! Let's use gpt-3.5-turbo
for both.
View the results in Braintrust
The evals will generate a link to the experiment page. Click into an experiment to view the results!
If you've just been following along, you can check out some sample results here. Type some searches into the search bar to see AI search in action. :)
Fine-tuning
Let's try to fine-tune the model with an exceedingly short prompt. We'll use the same dataset and scoring functions, but we'll change the prompt to be more concise. To start, let's play with one example:
Great! Now let's turn the output from the dataset into the tool call format that OpenAI expects.
This function also works on our few shot examples:
Since we're fine-tuning, we can also use a shorter prompt that just contains the object type (Experiment) and schema.
Let's construct messages from our train split and few-shot examples, and then fine-tune the model.