Tool calls in LLaMa 3.1
LLaMa 3.1 is distributed as an instruction-tuned model with 8B, 70B, and 405B parameter variants. As part of the release, Meta mentioned that
These are multilingual and have a significantly longer context length of 128K, state-of-the-art tool use, and overall stronger reasoning capabilities.
Let's dig into how we can use these models with tools, and run an eval to see how they compare to gpt-4o on a benchmark.
Setup
You can access LLaMa 3.1 models through inference services like Together, which has generous rate limits and OpenAI protocol compatibility. We'll use Together, through the Braintrust proxy to access LLaMa 3.1 and OpenAI models.
To get started, make sure you have a Braintrust account and an API key for Together and OpenAI. Make sure to plug them into your Braintrust account's
AI secrets configuration and acquire a BRAINTRUST_API_KEY. Feel free to put your BRAINTRUST_API_KEY in a .env.local
file next to this notebook, or just hardcode it into the code below.
As expected, the model can't answer the question without access to some tools. Traditionally, LLaMa models haven't supported tool calling. Some inference providers have attempted to solve this with controlled generation or similar methods, although to limited success. However, the documentation alludes to a new approach to tool calls:
Let's see if we can make this work with the commonly used weather tool definition.
Wow cool! Looks like we can get the model to call the tool. Let's quickly write a parser that can extract the function call from the response.
A real use case: LLM-as-a-Judge evaluators that make tool calls
At Braintrust, we maintain a suite of evaluator functions in the Autoevals library. Many of these evaluators, like Factuality
, are "LLM-as-a-Judge"
evaluators that use a well-crafted prompt to an LLM to reason about the quality of a response. We are big fans of tool calling, and leverage it extensively in autoevals
to make it easy and reliable
to parse the scores and reasoning they produce.
As we change autoevals, we run evals to make sure we improve performance and avoid regressing key scenarios. We'll run some of our autoeval evals as a way of assessing how well LLaMa 3.1 stacks up to gpt-4o.
Here is a quick example of the Factuality
scorer, a popular LLM-as-a-Judge evaluator that uses the following prompt:
Now let's reproduce this with LLaMa 3.1.
Ok interesting! It parses but the response is a little different from the GPT-4o response. Let's put this to the test at scale with some evals.
Running evals
We use a subset of the CoQA dataset to test the Factuality scorer. Let's load the dataset and take a look at an example.
Not bad!
GPT-4o
Let's run a full eval with gpt-4o, LLaMa-3.1-8B, LLaMa-3.1-70B, and LLaMa-3.1-405B to see how they stack up. Since the evaluator generates a number
between 0 and 1, we'll use the NumericDiff
scorer to assess accuracy, and a custom NonNull
scorer to measure how many invalid tool calls are generated.
It looks like GPT-4o does pretty well. Tool calling has been a highlight of OpenAI's feature set for a while, so it's not surprising that it's able to successfully parse 100% of the tool calls.
LLama-3.1-8B, 70B, and 405B
Now let's evaluate each of the LLaMa-3.1 models.
Analyzing the results: LLaMa-3.1-8B
Ok, let's dig into the results. To start, we'll look at how LLaMa-3.1-8B compares to GPT-4o.
Although it's a fraction of the cost, it's both slower (likely due to rate limits) and worse performing than GPT-4o. 12 of the 60 cases failed to parse. Let's take a look at one of those in depth.
That definitely looks like an invalid tool call. Maybe we can experiment with tweaking the prompt to get better results.
Analyzing all models
If we look across models, we'll start to see some interesting takeaways.
- LLaMa-3.1-70B has no parsing errors, which is better than LLaMa-3.1-405B!
- Both LLaMa-3.1-70B and LLaMa-3.1-405B performed better than GPT-4o, although by a fairly small margin.
- LLaMa-3-70B is less than 25% the cost of GPT-4o, and is actually a bit better.
Where to go from here
In just a few minutes, we've cracked the code on how to perform tool calls with LLaMa-3.1 models and run a benchmark to compare their performance to GPT-4o. In doing so, we've found a few specific areas for improvement, e.g. parsing errors for tool calls, and a surprising outcome that LLaMa-3.1-70B is better than both LLaMa-3.1-405B and GPT-4o, yet a fraction of the cost.
To explore this further, you could:
- Expand the benchmark to measure other kinds of evaluators.
- Try providing few-shot examples or fine-tuning the models to improve their performance.
- Play with other models, like GPT-4o-mini or Claude to see how they compare.
Happy evaluating!