Evaluating audio with the OpenAI Realtime API
The OpenAI Realtime API, designed for building advanced multimodal conversational experiences, unlocks even more use cases in AI applications. However, evaluating this and other audio models' outputs in practice is an unsolved problem. In this cookbook, we'll build a robust application with the Realtime API, incorporating tool-calling and user input. Then, we'll evaluate the results. Let's get started!
Getting started
In this cookbook, we're going to build a speech-to-speech RAG agent that answers questions about the Braintrust documentation.
To get started, you'll need a few accounts:
and node
, npm
, and typescript
installed locally. If you'd like to follow along in code,
the realtime-rag
project contains a working example with all of the documents and code snippets we'll use.
Clone the repo
To start, clone the repo and install the dependencies:
Next, create a .env.local
file with your API keys:
Finally, make sure to set your OPENAI_API_KEY
environment variable in the AI providers section
of your account, and set the PINECONE_API_KEY
environment variable in the Environment variables section.
We'll use the local environment variables to embed and upload the vectors, and the Braintrust variables to run the RAG tool and LLM calls remotely.
Upload the vectors
To upload the vectors, run the upload-vectors.ts
script:
This script reads all the files from the docs-sample
directory, breaks them into sections based on headings, and creates vector embeddings for each section using OpenAI's API. It then stores those embeddings along with the section's title and content in Pinecone.
That's it for setup! Now let's dig into the code.
Accessing the Realtime API
Building with the OpenAI Realtime API is complex because it is built on WebSockets, and it lacks client-side authentication. However, the Braintrust AI Proxy makes it easy to connect to the API in a secure and scalable way. The proxy securely manages your OpenAI API key, issuing temporary credentials to your backend and frontend. The frontend sends any voice data from your app to the proxy, which handles secure communication with OpenAI’s Realtime API.
To access the Realtime API through the Braintrust proxy, we changed the proxy URL when instantiating the RealtimeClient
to https://braintrustproxy.com/v1/realtime
. In our app, the RealtimeClient
is initialized when the ConsolePage
component is rendered.
We set up this logic in page.tsx
:
You can also use our proxy with an AI provider’s API key, but you will not have access to other Braintrust features, like logging.
Creating a RAG tool
The retrieval logic also happens on the server side. We set up the helper function and route handler that queries Pinecone in route.ts
so that we can call the retrieval tool on the client side like this:
Currently, because of the way the Realtime API works, we have to use OpenAI tool calling here instead of Braintrust tool functions.
Setting up the system prompt
When we call the Realtime API, we pass it a set of instructions that are configured in conversation_config.js
:
Feel free to play around with the system prompt at any point, and see how it impacts the LLM's responses in the app.
Running the app
To run the app, navigate to /web
and run npm run dev
. You should have the app load on localhost:3000
.
Start a new conversation, and ask a few questions about Braintrust. Feel free to interrupt the bot, or ask unrelated questions, and see what happens. When you're finished, end the conversation. Have a couple of conversations to get a feel for some of the limitations and nuances of the bot - each conversation will come in handy in the next step.
Logging in Braintrust
In addition to client-side authentication, you’ll also get the other benefits of building with Braintrust, like logging, built in. When you ran the app and connected to the Realtime API, logs were generated for each conversation. When you closed the session, the log was complete and ready to view in Braintrust. Each LLM and tool call is contained in its own span inside of the trace. In addition, the audio files were uploaded as attachments in your trace. This means that you don’t have to exit the UI to listen to each of the inputs and outputs for the LLM calls.
Online evaluations
In Braintrust, you can run server-side online evaluations that are automatically run asynchronously as you upload logs. This makes it easier to evaluate your app in situations like this, where the prompt and tool might not be synced to Braintrust.
Audio evals are complex, because there are multiple aspects of your application you can focus on. In this cookbook, we'll use the vector search query as a proxy for the quality of the Realtime API's interpretation of the user's input.
Setting up your scorer
We'll need to create a scorer that captures the criteria we want to evaluate. Since we're dealing with complex RAG outputs, we'll use a custom LLM-as-a-judge scorer. For an LLM-as-a-judge scorer, you define a prompt that evaluates the output and maps its choices to specific scores.
Navigate to Library > Scorers and create a new scorer. Call your scorer BraintrustRAG and add the following prompt:
The prompt uses mustache syntax to map the input to the query that gets sent to Pinecone, and get the output. We'll also assign choice score to the options we included in the prompt.
Configuring your online eval
Navigate to Configuration and scroll down to Online scoring. Select Add rule to configure your online scoring rule. Select the scorer we just created from the menu, and deselect Apply to root span. We'll filter to the function span since that's where our tool is called.
The score will now automatically run at the specified sampling rate for all logs in the project.
Viewing your evaluations
Now that you've set up your online evaluations, you can view the scores from within your logs. Underneath each function span that was included in the sampling rate, you'll have an additional span with the score.
This particular function call was scored a 0. But if we take a closer look at the logs, we can see that the question was actually answered pretty well. You may notice this pattern for other logs as well - so is our function actually not performing well?
Improving your evals
There are three main ways to improve your evals:
- Refine the scoring function to ensure it accurately reflects the success criteria.
- Add new scoring functions to capture different performance aspects (for example, correctness or efficiency).
- Expand your dataset with more diverse or challenging test cases.
In this case, we need to be more precise about what we're testing for in our scoring function. In our application, we're asking for answers within the specific context of Braintrust, but our current scoring function is attempting to judge the responses to our questions objectively.
Let's edit our scoring function to test for that as precisely as possible.
Improving our existing scorer
Let's change the prompt for our scoring function to:
As you continue to iterate on your scoring function and generate more logs, you should aim to see your scores go up.
What's next
As you continue to build more AI applications with complex function calls and new APIs, it's important to continuously improve both your AI application and your evaluation process. Here are some resources to help you do just that: