Benchmarking inference providers
Although there are a small handful of open-source LLMs, there are a variety of inference providers that can host them for you, each with different cost, speed, and as we'll see below, accuracy trade-offs. And even if one provider excels at a certain model size, it may not be the best choice for another.
Key takeaways
It's very important to evaluate your specific use case against a variety of both models and providers to make an informed decision about which to use. What I learned is that the results are pretty unpredictable and vary across both provider and model size. Just because one provider has a good 8b model, doesn't mean that its 405b is fast or accurate.
Here are some things that surprised me:
- 8b models are consistently fast, but have high variance in accuracy
- One provider is fastest for 8b and 70b, yet slowest for 405b
- The best provider is different across the two benchmarks we ran
Hopefully this analysis will help you create your own benchmarks and make an informed decision about which provider to use.
Setup
Before you get started, make sure you have a Braintrust account and API keys for all the providers you want to test. Here, we're testing Together, Fireworks, and Lepton, although Braintrust supports several others (including Azure, Bedrock, Groq, and more).
Make sure to plug each provider's API key into your Braintrust account's AI secrets configuration and acquire a BRAINTRUST_API_KEY
.
Put your BRAINTRUST_API_KEY
in a .env.local
file next to this notebook, or just hardcode it into the code below.
Task code
We are going to reuse the task function from Tool calls in LLaMa 3.1, which is below. For a detailed explanation of the task, see that recipe.
Dataset
We'll use the same data as well: a subset of the CoQA dataset.
Running evals
Let's create a list of the providers we want to evaluate. Each provider conveniently names its flavor of each model slightly differently, so we can use these as a unique identifier.
To facilitate this test, we also self-hosted an official Meta-LLaMa-3.1-405B-Instruct-FP8 model, which is available on Hugging Face using vLLM. You can configure this model as a custom endpoint in Braintrust to use it alongside other providers.
Provider map
Eval code
We'll run each provider in parallel, and within the provider, we'll run each model in parallel. This roughly assumes that rate limits are per model, not per provider.
We're also running with a low concurrency level (3) to avoid overwhelming a provider and hitting rate limits. The Braintrust proxy handles rate limits for us, but they are reflected in the final task duration.
You'll also notice that we parse and track the provider as well as the model in each experiment's metadata. This allows us to do some rich analysis on the results.
Results
Let's start by looking at the project view. Braintrust makes it easy to morph this into a multi-level grouped analysis where we can see the score vs. duration in a scatter plot, and how each provider stacks up in the table.
Insights
Now let's dig into this chart and see what we can learn.
- 70b hits a nice sweet spot
It looks like on average, each weight class costs you an extra second on average. However, the jump in average accuracy from 8b to 70b is 16%+ while 70b to 405b is only 2.87%.
- 8b models are consistently really fast, but some providers' 70b models are slower than others'
The distribution among providers for 8b latency is very tight, but that starts to change with 70b and even more so with 405b models.
- High accuracy variance in 8b models
Within 8b models in particular, there is a pretty significant difference in accuracy
- Provider 1 is the fastest except for 405b
Interestingly, provider 1's 8b model is both the fastest and most accurate. However, its 405b model, while accurate, is the slowest by far. This is likely due to rate limits, or perhaps they have optimized it using a different method.
- Self-hosting strikes a nice balance
Self-hosting strikes a nice balance between latency and quality (note: we only tested self-hosted 405b). Of course, this comes at a price -- around $27/hour using Lambda Labs
Another benchmark
We also used roughly the same code on a different, more-realistic, internal benchmark which measures how well our AI search bar works. Here is the same visualization for that benchmark:
As you can see, certain things are consistent, but others are not. Again, this highlights how important it is to run this analysis on your own use case.
- Provider 1 is less differentiated. Although Provider 1 is still the fastest, it comes at the cost of accuracy in the 70b and 405b classes, where Provider 2 wins on accuracy. Provider 2 also wins on speed for 405b.
- Provider 3 has a hard time in the 70b class. This workload is heavy on prompt tokens (~3500 per test case). Maybe that has something to do with it?
- More latency variance across the board. Again, this may have to do with the significant jump in prompt tokens.
- Self-hosted seems to be about the same. Interestingly, the self-hosted model appears at about the same spot in the graph!
Where to go from here
This is just one benchmark, but as you can see, there is a pretty significant difference in speed and accuracy between providers. I'd highly encourage testing on your own workload and using a tool like Braintrust to help you construct a good eval and understand the trade-offs across providers in depth.
Feel free to reach out if we can help, or feel free to sign up to try out Braintrust for yourself. If you enjoy performing this kind of analysis, we are hiring.
Happy evaluating!
Thanks to Hamel for hosting the self-hosted model and feedback on drafts.