Evaluating a chat assistant
Evaluating a multi-turn chat assistant
This tutorial will walk through using Braintrust to evaluate a conversational, multi-turn chat assistant.
These types of chat bots have become important parts of applications, acting as customer service agents, sales representatives, or travel agents, to name a few. As an owner of such an application, it's important to be sure the bot provides value to the user.
We will expand on this below, but the history and context of a conversation is crucial in being able to produce a good response. If you received a request to "Make a dinner reservation at 7pm" and you knew where, on what date, and for how many people, you could provide some assistance; otherwise, you'd need to ask for more information.
Before starting, please make sure you have a Braintrust account. If you do not have one, you can sign up here.
Installing dependencies
Begin by installing the necessary dependencies if you have not done so already.
Inspecting the data
Let's take a look at the small dataset prepared for this cookbook. You can find the full dataset in the accompanying dataset.ts file. The assistant
turns were generated using claude-3-5-sonnet-20240620
.
Below is an example of a data point.
chat_history
contains the history of the conversation between the user and the assistantinput
is the lastuser
turn that will be sent in themessages
argument to the chat completionexpected
is the output expected from the chat completion given the input
From looking at this one example, we can see why the history is necessary to provide a helpful response.
If you were asked "Who won the men's trophy that year?" you would wonder What trophy? Which year? But if you were also given the chat_history
, you would be able to answer the question (maybe after some quick research).
Running experiments
The key to running evals on a multi-turn conversation is to include the history of the chat in the chat completion request.
Assistant with no chat history
To start, let's see how the prompt performs when no chat history is provided. We'll create a simple task function that returns the output from a chat completion.
Scoring and running the eval
We'll use the Factuality
scoring function from the autoevals library to check how the output of the chat completion compares factually to the expected value.
We will also utilize trials by including the trialCount
parameter in the Eval
call. We expect the output of the chat completion to be non-deterministic, so running each input multiple times will give us a better sense of the "average" output.
61.33% Factuality score? Given what we discussed earlier about chat history being important in producing a good response, that's surprisingly high. Let's log onto braintrust.dev and take a look at how we got that score.
Interpreting the results
If we look at the score distribution chart, we can see ten of the fifteen examples scored at least 60%, with over half even scoring 100%. If we look into one of the examples with 100% score, we see the output of the chat completion request is asking for more context as we would expect:
Could you please specify which athlete or player you're referring to? There are many professional athletes, and I'll need a bit more information to provide an accurate answer.
This aligns with our expectation, so let's now look at how the score was determined.
Click into the scoring trace, we see the chain of thought reasoning used to settle on the score. The model chose (E) The answers differ, but these differences don't matter from the perspective of factuality.
which is technically correct, but we want to penalize the chat completion for not being able to produce a good response.
Improve scoring with a custom scorer
While Factuality is a good general purpose scorer, for our use case option (E) is not well aligned with our expectations. The best way to work around this is to customize the scoring function so that it produces a lower score for asking for more context or specificity.
You can see the built-in Factuality prompt here. For our customized scorer, we've added two score choices to that prompt:
These will score (F) = 0.2 and (G) = 0 so the model gets some credit if there was any context it was able to gather from the user's input.
We can then use this spec and the LLMClassifierFromSpec
function to create our customer scorer to use in the eval function.
Read more about defining your own scorers in the documentation.
Re-running the eval
Let's now use this updated scorer and run the experiment again.
6.67% as a score aligns much better with what we expected. Let's look again into the results of this experiment.
Interpreting the results
In the table we can see the output
fields in which the chat completion responses are requesting more context. In one of the experiment that had a non-zero score, we can see that the model asked for some clarification, but was able to understand from the question that the user was inquiring about a controversial World Series. Nice!
Looking into how the score was determined, we can see that the factual information aligned with the expert answer but the submitted answer still asks for more context, resulting in a score of 20% which is what we expect.
Assistant with chat history
Now let's shift and see how providing the chat history improves the experiment.
Update the data, task function and scorer function
We need to edit the inputs to the Eval
function so we can pass the chat history to the chat completion request.
We update the parameter to the task function to accept both the input
string and the chat_history
array and add the chat_history
into the messages array in the chat completion request, done here using the spread ...
syntax.
We also need to update the experimentData
and Factual
function parameters to align with these changes.
Running the eval
Use the updated variables and functions to run a new eval.
60% score is a definite improvement from 4%.
You'll notice that it says there were 0 improvements and 0 regressions compared to the last experiment gpt-4o assistant - no history-934e5ca2
we ran. This is because by default, Braintrust uses the input
field to match rows across experiments. From the dashboard, we can customize the comparison key (see docs) by going to the project configuration page.
Update experiment comparison for diff mode
Let's go back to the dashboard.
For this cookbook, we can use the expected
field as the comparison key because this field is unique in our small dataset.
In the Configuration tab, go to the bottom of the page to update the comparison key:
Interpreting the results
Turn on diff mode using the toggle on the upper right of the table.
Since we updated the comparison key, we can now see the improvements in the Factuality score between the experiment run with chat history and the most recent one run without for each of the examples. If we also click into a trace, we can see the change in input parameters that we made above where it went from a string
to an object with input
and chat_history
fields.
All of our rows scored 60% in this experiment. If we look into each trace, this means the submitted answer includes all the details from the expert answer with some additional information.
60% is an improvement from the previous run, but we can do better. Since it seems like the chat completion is always returning more than necessary, let's see if we can tweak our prompt to have the output be more concise.
Improving the result
Let's update the system prompt used in the chat completion request.
In the task function, we'll update the system
message to specify the output should be precise and then run the eval again.
Let's go into the dashboard and see the new experiment.
Success! We got a 27 percentage point increase in factuality, up to an average score of 87% for this experiment with our updated prompt.
Conclusion
We've seen in this cookbook how to evaluate a chat assistant and visualized how the chat history effects the output of the chat completion. Along the way, we also utilized some other functionality such as updating the comparison key in the diff view and creating a custom scoring function.
Try seeing how you can improve the outputs and scores even further!