Text-to-SQL
This tutorial will teach you how to create an application that converts natural language questions into SQL queries, and then evaluating how well the queries work. We'll even make an improvement to the prompts, and evaluate the impact! By the time you finish this tutorial, you should be ready to run your own experiments.
Before starting, please make sure that you have a Braintrust account. If you do not, please sign up.
Setting up the environment
The next few commands will install some libraries and include some helper code for the text2sql application. Feel free to copy/paste/tweak/reuse this code in your own tools.
We're going to use a public dataset called WikiSQL that contains natural language questions and their corresponding SQL queries.
Exploring the data
In this section, we'll take a look at the dataset and ground truth text/sql pairs to better understand the problem and data.
Here's an example question:
We'll use Arrow and DuckDB to help us explore the data and run SQL queries on it:
In WikiSQL, the queries are formatted as a series of projection and filter expressions. Although there is a human_readable
field, it's not valid SQL!
Let's define a codegen_query
function that turns it into executable SQL.
Now, we can run this SQL directly.
Running your first experiment
In this section, we'll create our first experiment and analyze the results in Braintrust.
Exciting! Now that we've tested it out on an example, we can run an evaluation on a bigger dataset to understand how well the prompt works.
Running an eval
To run an eval, we simply need to stitch together the pieces we've already created into the Eval()
function, which takes:
- The data you want to evaluate
- A
task
function that, given some input, returns an output - One or more scoring functions that evaluate the output.
Let's start by logging into Braintrust. You can technically skip this step if you've set BRAINTRUST_API_KEY
in your environment.
Scoring functions
Next, we need to figure out how we'll score the outputs. One way is to string compare the SQL queries. This is not a perfect signal, because two different query strings might return the correct result, but it is a useful signal about how different the generated query is from the ground truth.
A more robust way to test the queries is to run them on a database and compare the results. We'll use DuckDB for this. We'll define a scoring function that runs the generated SQL and compares the results to the ground truth.
Once the eval completes, you can click on the link to see the results in the Braintrust UI.
Take a look at the failures. Feel free to explore individual examples, filter down to low answer
scores, etc. You should notice that idx=8
is one of the failures. Let's debug it and see if we can improve the prompt.
Debugging a failure
We'll first set idx=8
and reproduce the failure.
Here is the ground truth:
And then what the model spits out:
Hmm, if only the model knew that 'ushl'
is actually capitalized in the data. Let's fix this by providing some sample data for each column:
Ok great! Now let's re-run the loop with this new version of the code.
Wrapping up
Congrats 🎉. You've run your first couple of experiments. Now, return back to the tutorial docs to proceed to the next step where we'll analyze the experiments.