Classifying news articles
This is a quick tutorial on how to build and evaluate an AI app to classify news titles into categories with Braintrust.
Before starting, make sure that you have a Braintrust account. If you do not, please sign up first. After this tutorial, learn more by visiting the docs.
First, we'll install some dependencies.
Next, we'll import the ag_news dataset from Huggingface.
Writing the initial prompts
Let's first write a prompt for categorizing a title for just one article. With BrainTrust, you can use any library you'd like — OpenAI, OSS models, LangChain, Guidance, or even just direct calls to an LLM.
The prompt provides the article's title to the model, and asks it to generate a category.
Next, let's initialize an OpenAI client with your API key. We'll use wrap_openai
from the braintrust
library to automatically instrument the client to track useful metrics for you. When Braintrust is not initialized, wrap_openai
is a no-op.
It's time to try writing a prompt and classifying a title! We'll define a classify_article
function that takes an input title and returns a category. The @braintrust.traced
decorator, like wrap_openai
above, will help us trace inputs, outputs, and timing and is a no-op when Braintrust is not active.
Running across the dataset
Now that we have automated classifying titles, we can test the full set of articles using Braintrust's Eval
function. Behind the scenes, Eval
will in parallel run the classify_article
function on each article in the dataset, and then compare the results to the ground truth labels using a simple Levenshtein
scorer. When it finishes running, it will print out the results with a link to Braintrust to dig deeper.
Pause and analyze the results in Braintrust!
The cell above will print a link to the Braintrust experiment. Click on it to investigate where we can improve our AI app.
Looking at our results table (in the screenshot below), we incorrectly output Sci-Tech
instead of Sci/Tech
which results in a failed eval test case. Let's fix it.
Reproducing an example
First, let's see if we can reproduce this issue locally. We can test an article corresponding to the Sci/Tech
category and reproduce the evaluation:
Fixing the prompt
Have you spotted the issue? It looks like we misspelled one of the categories in our prompt. The dataset's categories are World
, Sports
, Business
and Sci/Tech
- but we are using Sci-Tech
in our prompt. Let's fix it:
Evaluate the new prompt
The model classified the correct category Sci/Tech
for this example. But, how do we know it works for the rest of the dataset? Let's run a new experiment to evaluate our new prompt using BrainTrust.
Conclusion
Click into the new experiment, and check it out! You should notice a few things:
- BrainTrust will automatically compare the new experiment to your previous one.
- You should see the eval scores increase and you can see which test cases improved.
- You can also filter the test cases that have a low score and work on improving the prompt for those.
Now, you are on your journey of building reliable AI apps with BrainTrust!