Coda's Help Desk with and without RAG
Welcome to Braintrust! In this notebook, you'll build and evaluate an AI app that answers questions about Coda's help desk.
To provide the LLM with relevant information from Coda's help desk, we'll use a technique called RAG (retrieval-augmented generation) to infuse our prompts with text from the most-relevant sections of their docs. To evaluate the performance of our app, we'll use an LLM to generate question-answer pairs from the docs, and we'll use a technique called model graded evaluation to automatically evaluate the final responses against the expected answers.
Before starting, please make sure that you have a Braintrust account. If you do not, please sign up or get in touch. After this tutorial, feel free to dig deeper by visiting the docs.
Download Markdown docs from Coda's help desk
Let's start by downloading the Coda docs and splitting them into their constituent Markdown sections.
Use the Braintrust AI proxy to access the OpenAI API
The Braintrust AI proxy provides a single API to access OpenAI and Anthropic models, LLaMa 2, Mistral and others. Here we use it to access gpt-3.5-turbo
. Because the Braintrust AI proxy automatically caches and reuses results (when temperature=0
or the seed
parameter is set, or when the caching mode is set to always
), we can re-evaluate the following prompts many times without incurring additional API costs.
If you'd prefer not to use the proxy, simply omit the base_url
and default_headers
parameters below.
Generate question-answer pairs
Before we start evaluating some prompts, let's first use the LLM to generate a bunch of question/answer pairs from the text at hand. We'll use these QA pairs as ground truth when grading our models later.
Evaluate a context-free prompt (no RAG)
Now let's evaluate a simple prompt that poses each question without providing any context from the Markdown docs. We'll evaluate this naive approach using (again) gpt-3.5-turbo
, with the Factuality prompt from the Braintrust autoevals library.
Pause and click into the experiment in Braintrust!
The cell above will print a link to a Braintrust experiment -- click on it to view our baseline eval.
Try using RAG to improve performance
Let's see if RAG (retrieval-augmented generation) can improve our results on this task.
First we'll compute embeddings for each Markdown section using text-embedding-ada-002
and create an index over the embeddings in LanceDB, a vector database. Then, for any given query, we can convert it to an embedding and efficiently find the most relevant context for an input query by converting it into an embedding and finding the best matches embedding space, and provide the corresponding text as additional context in our prompt.
Use AI to judge relevance of retrieved documents
We're almost there! One more trick -- let's actually retrieve a few more of the best-matching candidates from the vector database than we intend to use, then use gpt-3.5-turbo
to score the relevance of each candidate to the input query. We'll use the TOP_K
blurbs by relevance score in our QA prompt -- this should be a little more intelligent than just using the closest embeddings.
Run the RAG evaluation
Summary
Click into the new experiment and check it out. You should notice a few things:
- Braintrust will automatically compare the new experiment to your previous one.
- You should see an increase in scores with RAG. Click around to see exactly which examples improved.
- Try playing around with the constants set at the beginning of this tutorial, such as
NUM_QA_PAIRS
, to evaluate on a larger dataset.
We hope you had fun with this tutorial! You can learn more about Braintrust at https://www.braintrust.dev/docs.