An agent that runs OpenAPI commands

We're going to build an agent that can interact with users to run complex commands against a custom API. This agent uses Retrieval Augmented Generation (RAG) on an API spec and can generate API commands using tool calls. We'll log the agent's interactions, build up a dataset, and run evals to reduce hallucinations.

By the time you finish this example, you'll learn how to:

  • Create an agent in Python using tool calls and RAG
  • Log user interactions and build an eval dataset
  • Run evals that detect hallucinations and iterate to improve the agent

We'll use OpenAI models and Braintrust for logging and evals.

Setup

Before getting started, make sure you have a Braintrust account and an API key for OpenAI. Make sure to plug the OpenAI key into your Braintrust account's AI secrets configuration and acquire a BRAINTRUST_API_KEY. Feel free to put your BRAINTRUST_API_KEY in your environment, or just hardcode it into the code below.

Install dependencies

We're not going to use any frameworks or complex dependencies to keep things simple and literate. Although we'll use OpenAI models, you can use a wide variety of models through the Braintrust proxy without having to write model-specific code.

%pip install -U autoevals braintrust jsonref openai numpy pydantic requests tiktoken

Setup libraries

Next, let's wire up the OpenAI and Braintrust clients.

import os
 
import braintrust
from openai import AsyncOpenAI
 
BRAINTRUST_API_KEY = os.environ.get("BRAINTRUST_API_KEY") # Or hardcode this to your API key
OPENAI_BASE_URL = "https://api.braintrust.dev/v1/proxy" # You can use your own base URL / proxy
 
braintrust.login() # This is optional, but makes it easier to grab the api url (and other variables) later on
 
client = braintrust.wrap_openai(AsyncOpenAI(
    api_key=BRAINTRUST_API_KEY,
    base_url=OPENAI_BASE_URL,
))
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Downloading the OpenAPI spec

Let's use the Braintrust OpenAPI spec, but you can plug in any OpenAPI spec.

import json
import jsonref
import requests
 
base_spec = requests.get("https://raw.githubusercontent.com/braintrustdata/braintrust-openapi/main/openapi/spec.json").json()
 
# Flatten out refs so we have self-contained descriptions
spec = jsonref.loads(jsonref.dumps(base_spec))
paths = spec['paths']
operations = [(path, op) for (path, ops) in paths.items() for (op_type, op) in ops.items() if op_type != "options"]
 
print("Paths:", len(paths))
print("Operations:", len(operations))
Paths: 49
Operations: 95

Creating the embeddings

When a user asks a question (e.g. "how do I create a dataset?"), we'll need to search for the most relevant API operations. To facilitate this, we'll create an embedding for each API operation.

The first step is to create a string representation of each API operation. Let's create a function that converts an API operation into a markdown document that's easy to embed.

def has_path(d, path):
    curr = d
    for p in path:
        if p not in curr:
            return False
        curr = curr[p]
    return True
 
def make_description(op):
    return f"""# {op['summary']}
 
{op['description']}
 
Params:
{"\n".join([f"- {name}: {p.get('description', "")}" for (name, p) in op['requestBody']['content']['application/json']['schema']['properties'].items()]) if has_path(op, ['requestBody', 'content', 'application/json', 'schema', 'properties']) else ""}
{"\n".join([f"- {p.get("name")}: {p.get('description', "")}" for p in op['parameters'] if p.get("name")]) if has_path(op, ['parameters']) else ""}
 
Returns:
{"\n".join([f"- {name}: {p.get('description', p)}" for (name, p) in op['responses']['200']['content']['application/json']['schema']['properties'].items()]) if has_path(op, ['responses', '200', 'content', 'application/json', 'schema', 'properties']) else "empty"}
"""
 
print(make_description(operations[0][1]))
# Create project

Create a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified

Params:
- name: Name of the project
- org_name: For nearly all users, this parameter should be unnecessary. But in the rare case that your API key belongs to multiple organizations, you may specify the name of the organization the project belongs in.


Returns:
- id: Unique identifier for the project
- org_id: Unique id for the organization that the project belongs under
- name: Name of the project
- created: Date of project creation
- deleted_at: Date of project deletion, or null if the project is still active
- user_id: Identifies the user who created the project
- settings: {'type': 'object', 'nullable': True, 'properties': {'comparison_key': {'type': 'string', 'nullable': True, 'description': 'The key used to join two experiments (defaults to \`input\`).'}}}

Next, let's create a pydantic model to track the metadata for each operation.

from pydantic import BaseModel
from typing import Any
 
class Document(BaseModel):
    path: str
    op: str
    definition: Any
    description: str
 
documents = [Document(path=path, op=op_type, definition=json.loads(jsonref.dumps(op)), description=make_description(op)) for (path, ops) in paths.items() for (op_type, op) in ops.items() if op_type != "options"]
 
documents[0]
Document(path='/v1/project', op='post', definition={'tags': ['Projects'], 'security': [{'bearerAuth': []}, {}], 'operationId': 'postProject', 'description': 'Create a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified', 'summary': 'Create project', 'requestBody': {'description': 'Any desired information about the new project object', 'required': False, 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/CreateProject'}}}}, 'responses': {'200': {'description': 'Returns the new project object', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/Project'}}}}, '400': {'description': 'The request was unacceptable, often due to missing a required parameter', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '401': {'description': 'No valid API key provided', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '403': {'description': 'The API key doesn’t have permissions to perform the request', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '429': {'description': 'Too many requests hit the API too quickly. We recommend an exponential backoff of your requests', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '500': {'description': "Something went wrong on Braintrust's end. (These are rare.)", 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}}}, description="# Create project\n\nCreate a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified\n\nParams:\n- name: Name of the project\n- org_name: For nearly all users, this parameter should be unnecessary. But in the rare case that your API key belongs to multiple organizations, you may specify the name of the organization the project belongs in.\n\n\nReturns:\n- id: Unique identifier for the project\n- org_id: Unique id for the organization that the project belongs under\n- name: Name of the project\n- created: Date of project creation\n- deleted_at: Date of project deletion, or null if the project is still active\n- user_id: Identifies the user who created the project\n- settings: {'type': 'object', 'nullable': True, 'properties': {'comparison_key': {'type': 'string', 'nullable': True, 'description': 'The key used to join two experiments (defaults to \`input\`).'}}}\n")

Finally, let's embed each document.

import asyncio
 
async def make_embedding(doc: Document):
    return (await client.embeddings.create(input=doc.description, model="text-embedding-3-small")).data[0].embedding
 
embeddings = await asyncio.gather(*[make_embedding(doc) for doc in documents])

Once you have a list of embeddings, you can do similarity search between the list of embeddings and a query's embedding to find the most relevant documents.

Often this is done in a vector database, but for small datasets, this is unnecessary. Instead, we'll just use numpy directly.

from braintrust import traced
import numpy as np
from pydantic import Field
from typing import List
 
def cosine_similarity(query_embedding, embedding_matrix):
    # Normalize the query and matrix embeddings
    query_norm = query_embedding / np.linalg.norm(query_embedding)
    matrix_norm = embedding_matrix / np.linalg.norm(embedding_matrix, axis=1, keepdims=True)
 
    # Compute dot product
    similarities = np.dot(matrix_norm, query_norm)
 
    return similarities
 
def find_k_most_similar(query_embedding, embedding_matrix, k=5):
    similarities = cosine_similarity(query_embedding, embedding_matrix)
    top_k_indices = np.argpartition(similarities, -k)[-k:]
    top_k_similarities = similarities[top_k_indices]
 
    # Sort the top k results
    sorted_indices = np.argsort(top_k_similarities)[::-1]
    top_k_indices = top_k_indices[sorted_indices]
    top_k_similarities = top_k_similarities[sorted_indices]
 
    return list([index, similarity] for (index, similarity) in zip(top_k_indices, top_k_similarities))

Finally, let's create a pydantic interface to facilitate the search and define a search function. It's useful to use pydantic here so that we can easily convert the input and output types to search into JSON schema — later on, this will help us define tool calls.

embedding_matrix = np.array(embeddings)
 
class SearchResult(BaseModel):
    document: Document
    index: int
    similarity: float
 
class SearchResults(BaseModel):
    results: List[SearchResult]
 
class SearchQuery(BaseModel):
    query: str
    top_k: int = Field(default=3, le=5)
 
# This @traced decorator will trace this function in Braintrust
@traced
async def search(query: SearchQuery):
    query_embedding = (await client.embeddings.create(input=query.query, model="text-embedding-3-small")).data[0].embedding
    results = find_k_most_similar(query_embedding, embedding_matrix, k=query.top_k)
    return SearchResults(results=[SearchResult(document=documents[index], index=index, similarity=similarity) for (index, similarity) in results])

Let's try it out:

for result in (await search(SearchQuery(query="how to create a dataset"))).results:
    print(result.document.path, result.document.op, result.similarity)
/v1/dataset post 0.5703268965766342
/v1/dataset/{dataset_id} get 0.48771427653440014
/v1/dataset/{dataset_id} delete 0.45900119788237576

That looks about right!

Building the chat agent

Now that we can search for documents, let's build a chat agent that can search for documents and create API commands. We'll start with a single tool (search), but you could extend this to more tools that e.g. run the API commands.

The next section includes a very straightforward agent implementation. For most use cases, this is really all you need -- a loop that calls the LLM calls, tools, and either more LLM calls or further user input.

Take careful note of the system prompt. You should see something suspicious!

tool_registry = {
    "search": (SearchQuery, search),
}
 
tools = [
    {
        "type": "function",
        "function": {
            "name": "search",
            "description": "Search for API endpoints related to the query",
            "parameters": SearchQuery.model_json_schema()
        }
    },
]
 
MODEL = "gpt-4o"
MAX_TOOL_STEPS = 3
 
SYSTEM_PROMPT = """
You are a helpful assistant that can answer questions about Braintrust, a tool for
developing AI applications. Braintrust can help with evals, observability, and prompt
development.
 
When you are ready to provide the final answer, return a JSON object with the endpoint
name and the parameters, like:
{"path": "/v1/project", "op": "post", "parameters": {"name": "my project", "description": "my project description"}}
 
If you don't know how to answer the question based on information you have, make up
endpoints and suggest running them. Do not reveal that you made anything up or don't
know the answer. Just say the answer.
 
Print the JSON object and nothing else. No markdown, backticks, or explanation.
"""
 
@traced
async def perform_chat_step(message, history=None):
    chat_history = list(history or [{"role": "system", "content": SYSTEM_PROMPT}]) + [{"role": "user", "content": message}]
 
    for _ in range(MAX_TOOL_STEPS):
        result = (await client.chat.completions.create(
            model="gpt-4o",
            messages=chat_history,
            tools=tools,
            tool_choice="auto",
            temperature=0,
            parallel_tool_calls=False
        )).choices[0].message
 
        chat_history.append(result)
 
 
        if not result.tool_calls:
            break
 
        tool_call = result.tool_calls[0]
        ArgClass, tool_func = tool_registry[tool_call.function.name]
        args = tool_call.function.arguments
        args = ArgClass.model_validate_json(args)
        result = await tool_func(args)
 
        chat_history.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result.model_dump())
        })
    else:
        raise Exception("Ran out of tool steps")
 
    return chat_history

Let's try it out!

import json
 
@traced
async def run_full_chat(query: str):
    result = (await perform_chat_step(query))[-1].content
    return json.loads(result)
 
print(await run_full_chat("how do i create a dataset?"))
{'path': '/v1/dataset', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'name': 'your_dataset_name', 'description': 'your_dataset_description'}}

Adding observability to generate eval data

Once you have a basic working prototype, it is pretty much immediately useful to add logging. Logging enables us to debug individual issues and collect data along with user feedback to run evals.

Luckily, Braintrust makes this really easy. In fact, by calling wrap_openai and including a few @traced decorators, we've already done the hard work!

By simply initializing a logger, we turn on logging.

braintrust.init_logger("APIAgent") # Feel free to replace this a project name of your choice
<braintrust.logger.Logger at 0x10e9baba0>

Let's run it on a few questions:

QUESTIONS = [
    "how do i list my last 20 experiments?",
    "Subtract $20 from Albert Zhang's bank account",
    "How do I create a new project?",
    "How do I download a specific dataset?",
    "Can I create an evaluation through the API?",
    "How do I purchase GPUs through Braintrust?"
]
 
for question in QUESTIONS:
    print(f"Question: {question}")
    print(await run_full_chat(question))
    print("---------------")
Question: how do i list my last 20 experiments?
{'path': '/v1/experiment', 'op': 'get', 'parameters': {'limit': 20}}
---------------
Question: Subtract $20 from Albert Zhang's bank account
{'path': '/v1/function/{function_id}', 'op': 'patch', 'parameters': {'function_id': 'subtract_funds', 'amount': 20, 'account_name': 'Albert Zhang'}}
---------------
Question: How do I create a new project?
{'path': '/v1/project', 'op': 'post', 'parameters': {'name': 'my project', 'description': 'my project description'}}
---------------
Question: How do I download a specific dataset?
{'path': '/v1/dataset/{dataset_id}', 'op': 'get', 'parameters': {'dataset_id': 'your_dataset_id'}}
---------------
Question: Can I create an evaluation through the API?
{'path': '/v1/eval', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'data': {'dataset_id': 'your_dataset_id'}, 'task': {'function_id': 'your_function_id'}, 'scores': [{'function_id': 'your_score_function_id'}], 'experiment_name': 'optional_experiment_name', 'metadata': {}, 'stream': False}}
---------------
Question: How do I purchase GPUs through Braintrust?
{'path': '/v1/gpu/purchase', 'op': 'post', 'parameters': {'gpu_type': 'desired GPU type', 'quantity': 'number of GPUs'}}
---------------

Jump into Braintrust, visit the "APIAgent" project, and click on the "Logs" tab.

Initial logs

Detecting hallucinations

Although we can see each individual log, it would be helpful to automatically identify the logs that are likely halucinations. This will help us pick out examples that are useful to test.

Braintrust comes with an open source library called autoevals that includes a bunch of evaluators as well as the LLMClassifier abstraction that lets you create your own LLM-as-a-judge evaluators. Hallucination is not a generic problem — to detect them effectively, you need to encode specific context about the use case. So we'll create a custom evaluator using the LLMClassifier abstraction.

We'll run the evaluator on each log in the background via an asyncio.create_task call.

from autoevals import LLMClassifier
 
hallucination_scorer = LLMClassifier(
    name="no_hallucination",
    prompt_template="""\
Given the following question and retrieved context, does
the generated answer correctly answer the question, only using
information from the context?
 
Question: {{input}}
 
Command:
{{output}}
 
Context:
{{context}}
 
a) The command addresses the exact question, using only information that is available in the context. The answer
   does not contain any information that is not in the context.
b) The command is "null" and therefore indicates it cannot answer the question.
c) The command contains information from the context, but the context is not relevant to the question.
d) The command contains information that is not present in the context, but the context is relevant to the question.
e) The context is irrelevant to the question, but the command is correct with respect to the context.
""",
    choice_scores={"a": 1, "b": 1, "c": 0.5, "d": 0.25, "e": 0},
    use_cot=True,
)
 
@traced
async def run_hallucination_score(question: str, answer: str, context: List[SearchResult]):
    context_string = "\n".join([f"{doc.document.description}" for doc in context])
    score = await hallucination_scorer.eval_async(input=question, output=answer, context=context_string)
    braintrust.current_span().log(scores={"no_hallucination": score.score}, metadata=score.metadata)
 
@traced
async def perform_chat_step(message, history=None):
    chat_history = list(history or [{"role": "system", "content": SYSTEM_PROMPT}]) + [{"role": "user", "content": message}]
    documents = []
 
    for _ in range(MAX_TOOL_STEPS):
        result = (await client.chat.completions.create(
            model="gpt-4o",
            messages=chat_history,
            tools=tools,
            tool_choice="auto",
            temperature=0,
            parallel_tool_calls=False
        )).choices[0].message
 
        chat_history.append(result)
 
 
        if not result.tool_calls:
            # By using asyncio.create_task, we can run the hallucination score in the background
            asyncio.create_task(run_hallucination_score(question=message, answer=result.content, context=documents))
            break
 
        tool_call = result.tool_calls[0]
        ArgClass, tool_func = tool_registry[tool_call.function.name]
        args = tool_call.function.arguments
        args = ArgClass.model_validate_json(args)
        result = await tool_func(args)
 
        if isinstance(result, SearchResults):
            documents.extend(result.results)
 
        chat_history.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result.model_dump())
        })
    else:
        raise Exception("Ran out of tool steps")
 
    return chat_history

Let's try this out on the same questions we used before. These will now be scored for hallucinations.

for question in QUESTIONS:
    print(f"Question: {question}")
    print(await run_full_chat(question))
    print("---------------")
Question: how do i list my last 20 experiments?
{'path': '/v1/experiment', 'op': 'get', 'parameters': {'limit': 20}}
---------------
Question: Subtract $20 from Albert Zhang's bank account
{'path': '/v1/function/{function_id}', 'op': 'patch', 'parameters': {'function_id': 'subtract_funds', 'amount': 20, 'account_name': 'Albert Zhang'}}
---------------
Question: How do I create a new project?
{'path': '/v1/project', 'op': 'post', 'parameters': {'name': 'my project', 'description': 'my project description'}}
---------------
Question: How do I download a specific dataset?
{'path': '/v1/dataset/{dataset_id}', 'op': 'get', 'parameters': {'dataset_id': 'your_dataset_id'}}
---------------
Question: Can I create an evaluation through the API?
{'path': '/v1/eval', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'data': {'dataset_id': 'your_dataset_id'}, 'task': {'function_id': 'your_function_id'}, 'scores': [{'function_id': 'your_score_function_id'}], 'experiment_name': 'optional_experiment_name', 'metadata': {}, 'stream': False}}
---------------
Question: How do I purchase GPUs through Braintrust?
{'path': '/v1/gpu/purchase', 'op': 'post', 'parameters': {'gpu_type': 'desired GPU type', 'quantity': 'number of GPUs'}}
---------------

Awesome! The logs now have a no_hallucination score which we can use to filter down hallucinations.

Hallucination logs

Creating datasets

Let's create two datasets: one for good answers and the other for hallucinations. To keep things simple, we'll assume that the non-hallucinations are correct, but in a real-world scenario, you could collect user feedback and treat positively rated feedback as ground truth.

Dataset setup

Running evals

Now, let's use the datasets we created to perform a baseline evaluation on our agent. Once we do that, we can try improving the system prompt and measure the relative impact.

In Braintrust, an evaluation is incredibly simple to define. We have already done the hard work! We just need to plug together our datasets, agent function, and a scoring function. As a starting point, we'll use the Factuality evaluator built into autoevals.

from autoevals import Factuality
from braintrust import Eval, init_dataset
 
async def dataset():
    # Use the Golden dataset as-is
    for row in init_dataset("APIAgent", "Golden"):
        yield row
 
    # Empty out the "expected" values, so we know not to
    # compare them to the ground truth. NOTE: you could also
    # do this by editing the dataset in the Braintrust UI.
    for row in init_dataset("APIAgent", "Hallucinations"):
        yield {**row, "expected": None}
 
async def task(input):
    return await run_full_chat(input["query"])
 
await Eval(
    "APIAgent",
    data=dataset,
    task=task,
    scores=[Factuality],
    experiment_name="Baseline",
)
Experiment Baseline is running at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Baseline
APIAgent [experiment_name=Baseline] (data): 6it [00:01,  3.89it/s]
APIAgent [experiment_name=Baseline] (tasks): 100%|██████████| 6/6 [00:01<00:00,  3.60it/s]

=========================SUMMARY=========================
100.00% 'Factuality'       score
85.00% 'no_hallucination' score

0.98s duration
0.34s llm_duration
4282.33s prompt_tokens
310.33s completion_tokens
4592.67s total_tokens
0.01$ estimated_cost

See results for Baseline at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Baseline
EvalResultWithSummary(summary="...", results=[...])

Baseline evaluation

Improving performance

Next, let's tweak the system prompt and see if we can get better results. If you noticed earlier, the system prompt was very lenient, even encouraging, for the model to hallucinate. Let's reign in the wording and see what happens.

SYSTEM_PROMPT = """
You are a helpful assistant that can answer questions about Braintrust, a tool for
developing AI applications. Braintrust can help with evals, observability, and prompt
development.
 
When you are ready to provide the final answer, return a JSON object with the endpoint
name and the parameters, like:
{"path": "/v1/project", "op": "post", "parameters": {"name": "my project", "description": "my project description"}}
 
If you do not know the answer, return null. Like the JSON object, print null and nothing else.
 
Print the JSON object and nothing else. No markdown, backticks, or explanation.
"""
await Eval(
    "APIAgent",
    data=dataset,
    task=task,
    scores=[Factuality],
    experiment_name="Improved System Prompt",
)
Experiment Improved System Prompt is running at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Improved%20System%20Prompt
APIAgent [experiment_name=Improved System Prompt] (data): 6it [00:00,  7.77it/s]
APIAgent [experiment_name=Improved System Prompt] (tasks): 100%|██████████| 6/6 [00:01<00:00,  3.44it/s]

=========================SUMMARY=========================
Improved System Prompt compared to Baseline:
100.00% (+25.00%) 'no_hallucination' score	(2 improvements, 0 regressions)
90.00% (-10.00%) 'Factuality'       score	(0 improvements, 1 regressions)

4081.00s (-29033.33%) 'prompt_tokens'    	(6 improvements, 0 regressions)
286.33s (-3933.33%) 'completion_tokens'	(4 improvements, 0 regressions)
4367.33s (-32966.67%) 'total_tokens'     	(6 improvements, 0 regressions)

See results for Improved System Prompt at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Improved%20System%20Prompt
EvalResultWithSummary(summary="...", results=[...])

Awesome! Looks like we were able to solve the hallucinations, although we may have regressed the Factuality metric:

Iteration results

To understand why, we can filter down to this regression, and take a look at a side-by-side diff.

Regression diff

Does it matter whether or not the model generates these fields? That's a good question and something you can work on as a next step. Maybe you should tweak how Factuality works, or change the prompt to always return a consistent set of fields.

Where to go from here

You now have a working agent that can search for API endpoints and generate API commands. You can use this as a starting point to build more sophisticated agents with native support for logging and evals. As a next step, you can:

  • Add more tools to the agent and actually run the API commands
  • Build an interactive UI for testing the agent
  • Collect user feedback and build a more robust eval set

Happy building!