Prompt versioning and deployment

Feb 24, 2025

Large language models can sometimes feel unpredictable, where small changes to a prompt can dramatically change the quality and tone of generated responses. In customer support, this is especially important, since customer satisfaction, brand tone, and the clarity of solutions offered all rely on consistent, high-quality prompts. Optimizing this process involves creating a couple variations, measuring their effectiveness, and sometimes returning to previous versions that performed better.

In this cookbook, we'll build a support chatbot and walk through the complete cycle of prompt development. Starting with a basic implementation, we'll create increasingly sophisticated prompts, keep track of different versions, evaluate their performance, and switch back to earlier versions when necessary.

Getting started

Before getting started, make sure you have a Braintrust account and an API key for OpenAI. Make sure to plug the OpenAI key into your Braintrust account's AI provider configuration.

Once you have your Braintrust account set up with an OpenAI API key, install the following dependencies:

pip install braintrust autoevals openai

To authenticate with Braintrust, export your BRAINTRUST_API_KEY as an environment variable:

export BRAINTRUST_API_KEY="YOUR_API_KEY_HERE"

Exporting your API key is a best practice, but to make it easier to follow along with this cookbook, you can also hardcode it into the code below.

Once the API key is set, we can import our modules and initialize the OpenAI client using the AI proxy:

import os
import subprocess
from openai import OpenAI
from braintrust import Eval, wrap_openai, invoke
from autoevals import LLMClassifier
 
# Uncomment the following line to hardcode your API key
# os.environ["BRAINTRUST_API_KEY"] = "YOUR_API_KEY_HERE"
 
# Initialize OpenAI client with Braintrust wrapper
client = wrap_openai(
    OpenAI(
        base_url="https://api.braintrust.dev/v1/proxy",
        api_key=os.environ["BRAINTRUST_API_KEY"],
    )
)
 
project_name = "SupportChatbot"

Creating a dataset

We'll create a small dataset of sample customer complaints and inquiries to evaluate our prompts. In a production application, you'd want to use real customer interactions from your logs to create a representative dataset.

dataset = [
    {
        "input": "Why did my package disappear after tracking showed it was delivered?",
        "metadata": {"category": "shipping"},
    },
    {
        "input": "Your product smells like burnt rubber - what’s wrong with it?",
        "metadata": {"category": "product"},
    },
    {
        "input": "I ordered 3 items but only got 1, where’s the rest?",
        "metadata": {"category": "shipping"},
    },
    {
        "input": "Why does your app crash every time I try to check out?",
        "metadata": {"category": "tech"},
    },
    {
        "input": "My refund was supposed to be here 2 weeks ago - what’s the holdup?",
        "metadata": {"category": "returns"},
    },
    {
        "input": "Your instructions say ‘easy setup’ but it took me 3 hours!",
        "metadata": {"category": "product"},
    },
    {
        "input": "Why does your delivery guy keep leaving packages at the wrong house?",
        "metadata": {"category": "shipping"},
    },
    {
        "input": "The discount code you sent me doesn’t work - fix it!",
        "metadata": {"category": "sales"},
    },
    {
        "input": "Your support line hung up on me twice - what’s going on?",
        "metadata": {"category": "support"},
    },
    {
        "input": "Why is your website saying my account doesn’t exist when I just made it?",
        "metadata": {"category": "tech"},
    },
]

Creating a scoring function

When evaluating support responses, we care about tone, helpfulness, and professionalism, not just accuracy. To do this, we use an LLMClassifier that checks for alignment with brand guidelines:

brand_alignment_scorer = LLMClassifier(
    name="BrandAlignment",
    prompt_template="""
    Evaluate if the response aligns with our brand guidelines (Y/N):
    1. **Positive Tone**: Uses upbeat language, avoids negativity (e.g., "We’re thrilled to help!" vs. "That’s your problem").
    2. **Proactive Approach**: Offers a clear next step or solution (e.g., "We’ll track it now!" vs. vague promises).
    3. **Apologetic When Appropriate**: Acknowledges issues with empathy (e.g., "So sorry for the mix-up!" vs. ignoring the complaint).
    4. **Solution-Oriented**: Focuses on fixing the issue for the customer (e.g., "Here’s how we’ll make it right!" vs. excuses).
    5. **Professionalism**: There should be no profanity, or emojis.
 
    Response: {{output}}
 
 
    Only give a Y if all the criteria are met. If one is missing and it should be there, give a N.
    """,
    choice_scores={
        "Y": 1,
        "N": 0,
    },  # This scorer will return a 1 if the response fully matches all brand guidelines, and a 0 otherwise.
    use_cot=True,
)

Creating a prompt

To push a prompt to Braintrust, we need to create a new Python file prompt_v1.py that defines the prompt. Once we've created the file, we can push it to Braintrust via the CLI. Let's start with a basic prompt that provides a direct response to customer inquiries:

# Create a prompt_v1.py file
 
import braintrust
 
project = braintrust.projects.create(name="SupportChatbot")
 
prompt_v1 = project.prompts.create(
    name="Brand Support V1",
    slug="brand-support-v1",
    description="Simple support prompt",
    model="gpt-4o",
    messages=[{"role": "user", "content": "{{{input}}}"}],
    if_exists="replace",
)

To push the prompt to Braintrust, run:

braintrust push prompt_v1.py

After pushing the prompt, you'll see it in the Braintrust UI.

promptv1

Evaluating prompt v1

Now that our first prompt is ready, we'll define a task function that calls this prompt. Then, we'll run an evaluation with our brand_alignment_scorer:

# Define task using invoke with correct input
def task_v1(input):
    result = invoke(
        project_name=project_name,
        slug="brand-support-v1",
        input={"input": input},  # Matches {{{input}}} in our prompt
    )
    return result
 
 
eval_task = Eval(
    project_name,
    data=lambda: dataset,
    task=task_v1,
    scores=[brand_alignment_scorer],
    experiment_name="prompt_v1",
)

After running the evaluation, you'll see the results in the Braintrust UI: v1results1

Improving our prompt

Our initial evaluation showed that there is room for improvement. Let's create a more sophisticated prompt that incorporates our brand guidelines to encourage a positive, proactive tone and clear solutions. Like before, we'll create a new Python file called prompt_v2.py and push it to Braintrust.

# Create a prompt_v2.py file
 
import braintrust
 
project = braintrust.projects.create(name="SupportChatbot")
 
prompt_v2 = project.prompts.create(
    name="Brand Support V2",
    slug="brand-support-v2",
    description="Brand-aligned support prompt",
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You’re a cheerful, proactive assistant for Sunshine Co. Always use a positive tone, apologize for issues with empathy, and offer clear solutions to delight customers! No emojis or profanity.",
        },
        {"role": "user", "content": "{{{input}}}"},
    ],
    if_exists="replace",
)

To push the prompt to Braintrust, run:

braintrust push prompt_v2.py

Evaluating prompt v2

We now point our task function to the slug of our second prompt:

def task_v2(input):
    result = invoke(
        project_name=project_name,
        slug="brand-support-v2",
        input={"input": input},
    )
    return result
 
 
eval_task = Eval(
    project_name,
    data=lambda: dataset,
    task=task_v2,
    scores=[brand_alignment_scorer],
    experiment_name="prompt_v2",
)

There is a clear improvement in brand alignment. v2results1

Experimenting with tone

For our third prompt, let's create prompt_v3.py and exaggerate the brand voice further. This example is intentionally over the top to show how brand alignment might fail if the tone is too extreme or vague in offering solutions. In practice, you'd likely use more subtle variations.

# Create a prompt_v3.py file
 
import braintrust
 
project = braintrust.projects.create(name="SupportChatbot")
 
prompt_v3 = project.prompts.create(
    name="Brand Support V3",
    slug="brand-support-v3",
    description="Over-enthusiastic support prompt with middling performance",
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You’re a SUPER EXCITED Sunshine Co. assistant! SHOUT IN ALL CAPS WITH LOTS OF EXCLAMATIONS!!!! SAY SORRY IF SOMETHING’S WRONG BUT KEEP IT VAGUE AND FUN!!! Make customers HAPPY with BIG ENERGY, even if solutions are UNCLEAR!!!!",
        },
        {"role": "user", "content": "{{{input}}}"},
    ],
    if_exists="replace",
)

To push the prompt to Braintrust, run:

braintrust push prompt_v3.py

Evaluating prompt v3

def task_v3(input):
    result = invoke(
        project_name=project_name,
        slug="brand-support-v3",
        input={"input": input},
    )
    return result
 
 
eval_task = Eval(
    project_name,
    data=lambda: dataset,
    task=task_v3,
    scores=[brand_alignment_scorer],
    experiment_name="prompt_v3",
)

You might notice a lower brand alignment score here. This highlights why controlled tone adjustments are crucial in real-world scenarios, and how you might need several iterations to find an optimal prompt.

v3results

Managing prompt versions

After evaluating all three versions, we found that our second prompt achieved the highest score.

scores

Although we've iterated on the prompt, Braintrust makes it simple to revert to this high-performing version:

def task_reverted(input):
    result = invoke(
        project_name=project_name,
        slug="brand-support-v2",
        input={"input": input},
    )
    return result
 
 
eval_task = Eval(
    project_name,
    data=lambda: dataset,
    task=task_reverted,
    scores=[brand_alignment_scorer],
    experiment_name="prompt_v2_reverted",
)

If you keep the same slug for multiple changes, Braintrust’s built-in versioning allows you to revert within the UI. See the docs on prompt versioning for more information.

versions1

Next steps

Now that you have some prompts saved, you can rapidly test them with new models in our prompt playground.
Learn more about evaluating a chat assistant.
Think about how you might add more sophisticated scoring functions to your evals.

On this page