Evaluating a web agent

Mar 8, 2025

Web navigation can be tricky for AI agents because they need to understand webpage layouts, visual elements, and remember previous steps to take the right actions. This cookbook focuses on how models decide what to do next, like clicking buttons, entering text, or choosing dropdown options.

We'll use the Multimodal-Mind2Web dataset, which combines screenshots and HTML, to help models make better decisions. We'll also discuss how to apply these lessons beyond just this dataset. By the end, you'll have a clear framework for testing how well your AI navigates websites and finding ways to improve it.

Getting started

To follow along, start by installing the required packages:

%pip install lxml openai datasets pillow braintrust autoevals

Next, make sure you have a Braintrust account, along with an OpenAI API key. To authenticate with Braintrust, export your BRAINTRUST_API_KEY as an environment variable:

export BRAINTRUST_API_KEY="YOUR_API_KEY_HERE"

Exporting your API key is a best practice, but to make it easier to follow along with this cookbook, you can also hardcode it into the code below.

We'll import our modules and initialize the OpenAI client using the Braintrust proxy:

import os
import json
import base64
import re
import time
from typing import Dict, Any, List, Optional, Tuple
 
from lxml import etree
import openai
from datasets import load_dataset
from PIL import Image
from io import BytesIO
 
from braintrust import (
    Eval,
    Attachment,
    start_span,
    wrap_openai,
)
 
# Constants
MAX_SAMPLES = 50
HTML_MAX_ELEMENTS = 50
MAX_PREVIOUS_ACTIONS = 3
 
# Uncomment the following line to hardcode your API key
# os.environ["BRAINTRUST_API_KEY"] = "YOUR_API_KEY_HERE"
 
client = wrap_openai(
    openai.OpenAI(
        api_key=os.environ.get("BRAINTRUST_API_KEY"),
        base_url="https://api.braintrust.dev/v1/proxy",
    )
)

There are a few ways AI models can navigate websites:

HTML-only: Uses page structure but misses visual details.
Screenshot-only: Captures visuals but misses interaction details.
Multimodal: Combines HTML structure and screenshots for better decisions.

In this cookbook, we'll use the multimodal approach, combining HTML DOM structure and screenshots.

Processing screenshots

First, let's write a function that converts screenshots of a given webpage into a format that we can use to pass to our model and attach to our eval.

def process_screenshot(screenshot_input: Any) -> Optional[Attachment]:
    with start_span(name="process_screenshot") as span:
        try:
            # Handle PIL Image
            if isinstance(screenshot_input, Image.Image):
                img_byte_arr = BytesIO()
                screenshot_input.save(img_byte_arr, format="PNG")
                image_data = img_byte_arr.getvalue()
 
            # Handle file path
            elif isinstance(screenshot_input, str) and os.path.exists(screenshot_input):
                with open(screenshot_input, "rb") as f:
                    image_data = f.read()
 
            # Handle bytes
            elif isinstance(screenshot_input, bytes):
                image_data = screenshot_input
 
            # Handle dictionary with base64 data
            elif isinstance(screenshot_input, dict) and "data" in screenshot_input:
                data = screenshot_input["data"]
                if not isinstance(data, str):
                    return None
 
                # Process base64 data
                if data.startswith("data:image"):
                    base64_data = data.split(",", 1)[1]
                elif data.startswith("/9j/") or data.startswith("iVBOR"):
                    base64_data = data
                else:
                    return None
 
                image_data = base64.b64decode(base64_data)
            else:
                return None
 
            # Create attachment
            result = Attachment(
                data=image_data,
                filename="screenshot.png",
                content_type="image/png",
            )
 
            return result
 
        except Exception:
            return None

Next, we'll identify and summarize important HTML elements on the webpage, making it easier for the model to quickly understand page structure:

def get_enhanced_tree_summary(
    html_content: str, max_items: int = HTML_MAX_ELEMENTS
) -> str:
    with start_span(name="html_parsing") as span:
        if not html_content:
            return "No HTML content provided"
 
        try:
            # Parse HTML
            parser = etree.HTMLParser()
            dom_tree = etree.fromstring(html_content, parser)
 
            # XPath for interactive elements, sorted by relevance
            xpath_queries = [
                "//button | //input[@type='submit'] | //input[@type='button']",
                "//a[@href] | //*[@role='button'] | //*[@onclick]",
                "//input[not(@type='hidden')] | //select | //textarea",
                "//label | //form",
                "//h1 | //h2 | //h3 | //nav | //*[@role='navigation']",
            ]
 
            # Collect elements by priority until max_items is reached
            important_elements = []
            for query in xpath_queries:
                if len(important_elements) >= max_items:
                    break
                elements = dom_tree.xpath(query)
                remaining_slots = max_items - len(important_elements)
                important_elements.extend(elements[:remaining_slots])
 
            # Create a concise representation
            summary = []
            for elem in important_elements:
                tag = elem.tag
 
                # Get text content, limited to 30 chars
                text = elem.text.strip() if elem.text else ""
                if not text:
                    for child in elem.xpath(".//text()"):
                        if child.strip():
                            text += " " + child.strip()
                text = text.strip()[:30]
 
                # Get key attributes
                key_attrs = [
                    "id",
                    "type",
                    "placeholder",
                    "href",
                    "role",
                    "aria-label",
                    "value",
                    "name",
                ]
                attrs = []
                for k in key_attrs:
                    if k in elem.attrib:
                        attrs.append(f'{k}="{elem.attrib[k]}"')
 
                # Format element representation
                elem_repr = f"<{tag} {' '.join(attrs)}>{text}</{tag}>"
                summary.append(elem_repr)
 
            return "\n".join(summary)
 
        except Exception as e:
            return f"Error parsing HTML: {str(e)}"

Keeping track of actions

Models perform better if they have context from previous steps. Without historical context, an agent might repeat actions or select incorrect next steps.

This function takes the latest few actions (up to MAX_PREVIOUS_ACTIONS) and neatly formats them for easy reference:

def format_previous_actions(
    actions: List[str], max_actions: int = MAX_PREVIOUS_ACTIONS
) -> str:
    if not actions:
        return "None"
 
    # Only take the most recent actions
    recent_actions = actions[-max_actions:]
 
    # Format with numbering
    formatted = "\n".join(
        [f"{i+1}. {action}" for i, action in enumerate(recent_actions)]
    )
 
    # Indicate if there were more actions before these
    if len(actions) > max_actions:
        formatted = (
            f"Showing {max_actions} most recent of {len(actions)} total actions\n"
            + formatted
        )
 
    return formatted

We also need a reliable way to convert raw action descriptions from our dataset into structured data our program can use. This function parses a provided action description and figures out the action type (CLICK, TYPE, or SELECT), and any associated values (like text typed):

def parse_operation_string(operation_str: str) -> Dict[str, str]:
    with start_span(name="parse_operation") as span:
        # Default values
        operation = {"op": "CLICK", "value": ""}
 
        if not operation_str:
            return operation
 
        try:
            # Try parsing as JSON first
            if operation_str.strip().startswith("{"):
                parsed = json.loads(operation_str)
                if isinstance(parsed, dict):
                    operation["op"] = parsed.get("op", "CLICK")
                    operation["value"] = parsed.get("value", "")
            else:
                # Fallback to regex parsing
                import re
 
                match_op = re.search(r"(CLICK|TYPE|SELECT)", operation_str)
                if match_op:
                    operation["op"] = match_op.group(1)
                    match_value = re.search(
                        r'value\s*[:=]\s*["\']?([^"\']+)["\']?', operation_str
                    )
                    if match_value:
                        operation["value"] = match_value.group(1)
        except Exception:
            pass
 
        return operation

Loading and preparing the dataset

Now that we've set up our helper functions, we can we load and process samples from the Multimodal-Mind2Web dataset:

def load_mind2web_samples(
    max_samples: int = MAX_SAMPLES, use_smaller_subset: bool = True
) -> List[Dict[str, Any]]:
 
    # Load the dataset with streaming to conserve memory
    split = "test_domain" if use_smaller_subset else "train"
    dataset = load_dataset("osunlp/Multimodal-Mind2Web", split=split, streaming=True)
 
    processed_samples = []
    successful_samples = 0
 
    # Process samples
    for item in dataset:
        if successful_samples >= max_samples:
            break
 
        try:
            with start_span(name="process_sample") as sample_span:
                # Extract basic fields
                annotation_id = item.get(
                    "annotation_id", f"sample_{successful_samples}"
                )
                website = item.get("website", "unknown")
                confirmed_task = item.get("confirmed_task", "Navigate the website")
                cleaned_html = item.get("cleaned_html", "<html></html>")
                operation_str = item.get("operation", '{"op": "CLICK", "value": ""}')
 
                # Process operation
                operation = parse_operation_string(operation_str)
 
                # Process screenshot
                screenshot_attachment = None
                screenshot_dict = item.get("screenshot")
                if screenshot_dict:
                    screenshot_attachment = process_screenshot(screenshot_dict)
 
                # Process HTML summary
                html_summary = get_enhanced_tree_summary(
                    cleaned_html, max_items=HTML_MAX_ELEMENTS
                )
 
                # Process previous actions
                action_reprs = item.get("action_reprs", [])
                previous_actions_str = format_previous_actions(
                    action_reprs, max_actions=MAX_PREVIOUS_ACTIONS
                )
 
                # Map operation type to the correct option letter
                expected_option = "A"  # Default to CLICK
                if operation["op"] == "TYPE":
                    expected_option = "B"
                elif operation["op"] == "SELECT":
                    expected_option = "C"
 
                # Create a focused prompt
                formatted_prompt = f"""
                    Task: {confirmed_task}
 
                    Key webpage elements:
                    {html_summary}
 
                    Previous actions:
                    {previous_actions_str}
 
                    What should be the next action? Select from:
                    A. Click the appropriate element based on the task
                    B. Type text into an input field
                    C. Select an option from a dropdown
                    """
 
                # Build complete sample
                sample = {
                    "annotation_id": annotation_id,
                    "website": website,
                    "confirmed_task": confirmed_task,
                    "html_summary": html_summary,
                    "operation": operation,
                    "previous_actions_str": previous_actions_str,
                    "formatted_prompt": formatted_prompt,
                    "expected_option": expected_option,
                    "expected_action": operation["op"],
                    "expected_value": operation["value"],
                    "screenshot_attachment": screenshot_attachment,
                }
 
                processed_samples.append(sample)
                successful_samples += 1
 
        except Exception:
            continue
 
    return processed_samples

We'll transform these samples to a format that your model can easily use during evaluation. This function creates structured samples clearly separating inputs (task, screenshot) from expected actions for comparison during evaluation:

def create_braintrust_dataset(samples: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
 
    dataset_samples = []
 
    for sample in samples:
        if not isinstance(sample, dict):
            continue
 
        # Extract operation details
        operation = sample.get("operation", {})
        operation_type = (
            operation.get("op", "CLICK") if isinstance(operation, dict) else "CLICK"
        )
        operation_value = (
            operation.get("value", "") if isinstance(operation, dict) else ""
        )
 
        # Create dataset entry
        dataset_entry = {
            "input": {
                "prompt": sample.get("formatted_prompt", ""),
                "task": sample.get("confirmed_task", ""),
                "website": sample.get("website", ""),
                "previous_actions": sample.get("previous_actions_str", "None"),
            },
            "expected": {
                "option": sample.get("expected_option", ""),
                "action": operation_type,
                "value": operation_value,
            },
            "metadata": {
                "annotation_id": sample.get("annotation_id", ""),
                "website": sample.get("website", ""),
                "operation_type": operation_type,
            },
        }
 
        # Add screenshot attachment if available
        if sample.get("screenshot_attachment"):
            dataset_entry["input"]["screenshot"] = sample["screenshot_attachment"]
 
        dataset_samples.append(dataset_entry)
 
    return dataset_samples

Building the prediction function

Next, we'll build the prediction function that will send each formatted input to the model (gpt-4o) and retrieve the predicted action:

def predict_with_gpt4o(input_data: Dict[str, Any]) -> Dict[str, Any]:
    with start_span(name="model_prediction") as predict_span:
        try:
            # Extract input components
            prompt = input_data.get("prompt", "")
            screenshot_attachment = input_data.get("screenshot")
 
            # Create system message requesting JSON output
            system_message = """You are a web navigation assistant that helps users complete tasks online.
                Analyze the webpage and determine the best action to take next based on the task.
 
                You MUST respond with a valid JSON object with the following structure:
                {
                "option": "A, B, or C",
                "op": "CLICK, TYPE, or SELECT",
                "value": "Only provide value for TYPE/SELECT actions"
                }
 
                Option A corresponds to CLICK, B to TYPE, and C to SELECT.
                For CLICK operations, include an empty value field.
 
                Example for clicking:
                {"option": "A", "op": "CLICK", "value": ""}
 
                Example for typing:
                {"option": "B", "op": "TYPE", "value": "search query text"}
 
                Example for selecting:
                {"option": "C", "op": "SELECT", "value": "dropdown option"}
                """
 
            # Create messages array
            messages = [{"role": "system", "content": system_message}]
 
            # Add screenshot if available
            if screenshot_attachment and hasattr(screenshot_attachment, "data"):
                try:
                    image_data = screenshot_attachment.data
                    base64_image = base64.b64encode(image_data).decode("utf-8")
 
                    messages.append(
                        {
                            "role": "user",
                            "content": [
                                {
                                    "type": "image_url",
                                    "image_url": {
                                        "url": f"data:image/png;base64,{base64_image}"
                                    },
                                },
                                {"type": "text", "text": prompt},
                            ],
                        }
                    )
                except Exception:
                    messages.append({"role": "user", "content": prompt})
            else:
                messages.append({"role": "user", "content": prompt})
 
            # Request JSON output format
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                max_tokens=150,
                temperature=0.2,
                response_format={"type": "json_object"},  # This is critical!
            )
 
            result = response.choices[0].message.content
 
            # Parse JSON response
            try:
                structured_response = json.loads(result)
 
                # Ensure the required fields exist
                if "option" not in structured_response:
                    structured_response["option"] = ""
                if "op" not in structured_response:
                    structured_response["op"] = ""
                if "value" not in structured_response:
                    structured_response["value"] = ""
 
                return structured_response
 
            except json.JSONDecodeError as e:
                # If JSON parsing fails, try to extract data from text
                option_match = re.search(r"Answer:\s*([ABC])", result, re.IGNORECASE)
                action_match = re.search(
                    r"Action:\s*(CLICK|TYPE|SELECT)", result, re.IGNORECASE
                )
                value_match = re.search(r"Value:\s*(.+?)(?:\n|$)", result)
 
                option = option_match.group(1).upper() if option_match else ""
                action = action_match.group(1).upper() if action_match else ""
                value = value_match.group(1).strip() if value_match else ""
 
                # Convert to structured format
                return {
                    "option": option,
                    "op": action,
                    "value": value,
                    "error": f"JSON parsing failed: {str(e)}",
                }
 
        except Exception as e:
            # Return error information in JSON format
            return {"option": "", "op": "ERROR", "value": str(e), "error": str(e)}

Defining our scorers

To evaluate how accurate the predictions are against the ground truth, we'll use two different scoring metrics. For web navigation tasks, we need metrics that can pinpoint specific strengths and weaknesses in our agent. We'll create two simple code-based scorers.

The first scorer checks if the predicted action matches the expected action type:

def option_selection_scorer(output: Dict[str, str], expected: Dict[str, Any]) -> int:
    return int(output["op"] == expected["action"])

The second evaluates whether the details of the action were correct:

def action_correctness_scorer(output: Dict[str, str], expected: Dict[str, Any]) -> int:
    # First, check if both action types match (note output uses "op" key)
    action_matches = output["op"] == expected["action"]
 
    # If the actions don't match, return 0 immediately
    if not action_matches:
        return 0
 
    # If we're dealing with a CLICK action, we've already confirmed they match
    if expected["action"] == "CLICK":
        return 1
 
    # For TYPE or SELECT, check if values match too
    return int(output["value"] == expected["value"])

Using two different scorers will help us identify whether errors come from misunderstanding the task context or from incorrectly formulating the action details.

Running the evaluation

Now that we've set up the task, dataset, and evaluation criteria, we're ready to run our evaluation. This function will load and process each dataset sample, generate predictions, and assess how accurately the model identifies the correct action type and associated details. All results will be captured in Braintrust, allowing us to analyze performance and pinpoint areas for improvement.

def run_mind2web_evaluation(sample_size: int = MAX_SAMPLES) -> None:
    try:
        # Load samples
        samples = load_mind2web_samples(max_samples=sample_size)
 
        if not samples:
            return
 
        # Create Braintrust dataset
        dataset = create_braintrust_dataset(samples)
 
        # Run the evaluation
        experiment_name = f"mind2web-{int(time.time())}"
        Eval(
            "multimodal-mind2web-eval",  # Project name
            data=dataset,
            task=predict_with_gpt4o,
            scores=[option_selection_scorer, action_correctness_scorer],
            experiment_name=experiment_name,
            metadata={
                "model": "gpt-4o",
            },
        )
 
    except Exception as e:
        print(f"Evaluation failed: {e}")
 
 
if __name__ == "__main__":
    # Run evaluation with a smaller sample size for testing. Adjust this number to run on more or less samples.
    run_mind2web_evaluation(sample_size=10)

Analyzing the results

Web agents have many configuration options that can impact their performance. In Braintrust, you can dig deeper into each trace to see each step the agent takes, including attachments and intermediate processing steps. This makes it easier to identify issues, debug quickly, and iterate.

attachment

Performance can also vary depending on context. For example, your agent might perform well on some websites but struggle with others, or handle certain action types better. In Braintrust, you can group and filter evaluation results by metadata, helping you quickly pinpoint patterns and identify areas for improvement.

grouping

Learning from the data

Taking the time to analyze your results in Braintrust will help you discover clear opportunities to improve your agent. For example, you might find that certain HTML preprocessing techniques perform better on form-intensive websites, or that providing more detailed historical context improves accuracy on complex tasks. By tracing each action, filtering results, and comparing different approaches systematically, you can make targeted improvements instead of relying on guesswork.

Next steps

Now that you've explored how to evaluate the decision making ability of a web agent, you can:

Learn more about how to evaluate agents
Check out the guide to what you should do after running an eval
Try out another agent cookbook