Generating beautiful HTML components
In this example, we'll build an app that automatically generates HTML components, evaluates them, and captures user feedback. We'll use the feedback and evaluations to build up a dataset that we'll use as a basis for further improvements.
The generator
We'll start by using a very simple prompt to generate HTML components using gpt-3.5-turbo
.
First, we'll initialize an openai client and wrap it with Braintrust's helper. This is a no-op until we start using the client within code that is instrumented by Braintrust.
This code generates a basic prompt:
Now, let's run this using gpt-3.5-turbo
. We'll also do a few things that help us log & evaluate this function later:
- Wrap the execution in a
traced
call, which will enable Braintrust to log the inputs and outputs of the function when we run it in production or in evals - Make its signature accept a single
input
value, which Braintrust'sEval
function expects - Use a
seed
so that this test is reproduceable
Examples
Let's look at a few examples!
To make this easier to validate, we'll use puppeteer to render the HTML as a screenshot.
Scoring the results
It looks like in a few of these examples, the model is generating a full HTML page, instead of a component as we requested. This is something we can evaluate, to ensure that it does not happen!
Now, let's update our function to compute this score. Let's also keep track of requests and their ids, so that we can provide user feedback. Normally you would store these in a database, but for demo purposes, a global dictionary should suffice.
Logging results
To enable logging to Braintrust, we just need to initialize a logger. By default, a logger is automatically marked as the current, global logger, and once initialized will be picked up by traced
.
Now, we'll run the generateComponent
function on a few examples, and see what the results look like in Braintrust.
Viewing the logs in Braintrust
Once this runs, you should be able to see the raw inputs and outputs, along with their scores in the project.
Capturing user feedback
Let's also track user ratings for these components. Separate from whether or not they're formatted as HTML, it'll be useful to track whether users like the design.
To do this, configure a new score in the project. Let's call it "User preference" and make it a 👍/👎.
Once you create a human review score, you can evaluate results directly in the Braintrust UI, or capture end-user feedback. Here, we'll pretend to capture end-user feedback. Personally, I liked the login form and logs viewer, but not the profile page. Let's record feedback accordingly.
As users provide feedback, you'll see the updates they make in each log entry.
Creating a dataset
Now that we've collected some interesting examples from users, let's collect them into a dataset, and see if we can improve the isComponent
score.
In the Braintrust UI, select the examples, and add them to a new dataset called "Interesting cases".
Once you create the dataset, it should look something like this:
Evaluating
Now that we have a dataset, let's evaluate the isComponent
function on it. We'll use the Eval
function, which takes a dataset and a function, and evaluates the function on each example in the dataset.
Once the eval runs, you'll see a summary which includes a link to the experiment. As expected, only one of the three outputs contains HTML, so the score is 33.3%. Let's also label user preference for this experiment, so we can track aesthetic taste manually. For simplicity's sake, we'll use the same labeling as before.
Improving the prompt
Next, let's try to tweak the prompt to stop rendering full HTML pages.
Nice, it looks like it no longer generates an html
tag. Let's re-run the Eval
(copy/pasted below for convenience).
Nice! We are now generating components without the <html>
tag.
Where to go from here
Now that we've run another experiment, a good next step would be to rate the new components and make sure we did not suffer a serious aesthetic regression. You can also collect more user examples, add them to the dataset, and re-evaluate to better assess how well your application works. Happy evaluating!