Measuring Agent Performance

Most teams deploy an agent and then watch one number: total conversations. That tells you almost nothing. A busy agent that frustrates users and hallucinates product information is worse than no agent at all. The teams winning with agents treat measurement the same way they treat product analytics: specific metrics, clear thresholds, and a ritual for acting on what they find^[1].

The Metrics That Matter

Five metrics give you a complete picture of agent health. Track all five from day one.

Resolution rate measures the percentage of conversations where the agent fully handles the user's request without human escalation. A new agent should hit 60% within the first month. A mature agent operating on well-scoped tasks should reach 80% or higher. If you are below 60%, your agent is generating support tickets, not resolving them.

Quality score is an LLM-graded evaluation of each conversation on a 1 to 5 scale. You feed the conversation transcript to a separate model with a rubric: Was the response accurate? Was it helpful? Did it stay on topic? Did it hallucinate?

A score of 4 or above means the conversation was good. Below 3 means something went wrong. This is the single most important metric because it catches problems that resolution rate misses, like conversations that technically "resolve" but leave users with incorrect information.

Revenue attribution tracks agent-influenced conversions. When an agent helps a user find a product, answers a pre-purchase question, or walks someone through checkout, and that user converts within the session, the agent gets attribution. This is the metric that justifies your investment to leadership.

User satisfaction should be binary: thumbs up or thumbs down. Do not use a 5-point scale. Users will not spend time on nuanced ratings in a chat interface. Binary feedback at a 15 to 20% response rate gives you a reliable signal.

Escalation rate is the inverse of resolution rate, but track it separately because the reasons for escalation matter more than the number. Tag every escalation with a category: knowledge gap, tool failure, user frustration, policy limitation. The category distribution tells you exactly where to invest next.

Instrumentation

Good measurement starts with good instrumentation. You need event tracking at four key moments in every conversation.

First, conversation start: capture the entry point, user context (authenticated or anonymous, device, referrer), and the first message. This tells you why users are reaching out.

Second, tool invocations: every time the agent calls a tool (product search, order lookup, navigation), log the tool name, input parameters, output, and latency. Tool failures are the most common cause of bad conversations.

Third, conversation end: capture the resolution status, total messages, duration, and whether the user escalated. This is your outcome signal.

Fourth, post-conversation events: did the user convert? Did they come back? Did they contact support about the same issue? These downstream signals separate agents that resolve issues from agents that just end conversations.

Define typed event schemas for each of these moments. When your schema is loose, your data is dirty, and dirty data makes every metric unreliable. A typed schema like ConversationStarted, ToolInvoked, ConversationEnded, and ConversationOutcome enforced at the code level prevents the drift that makes analytics useless six months in.

Store conversation metadata alongside events: the model version, prompt version, temperature setting, and any A/B test variant. You will need this for debugging and for comparing configurations later.

Building Dashboards

Build three dashboards, each serving a different audience.

The Core Operations dashboard is for the team running the agent day to day. It shows daily conversation volume, resolution rate trend over 30 days, average quality score, and the top 10 questions users are asking. This dashboard should update hourly. When resolution rate drops 5 points in a day, you want to know before users start complaining.

The Revenue dashboard is for leadership. It shows agent-influenced conversions, revenue attributed to agent interactions, cost per conversation, and ROI compared to human support for the same query types. Update this daily. Keep it simple: three to four charts, no clutter.

The Quality dashboard is for the team tuning the agent. It shows quality score distribution (what percentage of conversations score 1, 2, 3, 4, 5), the 10 worst conversations from the past week, failure mode breakdown, and before/after comparisons when you ship prompt or model changes. This dashboard needs drill-down capability. A chart showing average quality is useless without the ability to click into the worst conversations and read the transcripts.

For tooling, you do not need a dedicated analytics platform on day one. Postgres for conversation storage, Vercel Analytics for traffic patterns, and structured runtime logs for error tracking will take you through your first 10,000 conversations. Graduate to a dedicated pipeline when query performance on raw conversation data starts to degrade.

Error Monitoring

Agents fail differently from traditional software. A 500 error is obvious. An agent that confidently gives wrong information returns a 200 and looks fine in your uptime dashboard. You need monitoring that catches both types.

Structured logging is the foundation. Instrument every tool call, every external API request, and every conversation that ends in escalation. Capture the full conversation context in your log lines so you can reconstruct what happened without piecing together fragments.

Alert on spikes: if your tool failure rate exceeds 5% within a 15-minute window, that is an incident. Set up alerts for this. A broken product search tool means every conversation that needs product information will fail, and users will not wait around for you to notice.

Conversation-level error tracking means tagging each conversation with an error flag and category. "Tool timeout," "hallucinated product," "contradicted policy," "loop detected" are all distinct failure modes that require different fixes. Aggregate these daily and watch for trends.

The most common failures you will see: tool timeouts when upstream APIs slow down, hallucinated information when the agent lacks knowledge and fills the gap creatively, context window overflow on long conversations, and rate limiting from your model provider during traffic spikes. Build runbooks for each.

A/B Testing Agent Configurations

Small changes to an agent's configuration can produce large differences in performance. A prompt tweak that improves resolution rate by 5% compounds across thousands of conversations. But you need rigorous testing to know what actually works versus what just feels better when you read a few transcripts.

What to test: prompt variants (tone, instruction ordering, few-shot examples), model selection (Claude vs GPT-4 vs smaller models for specific tasks), temperature settings (lower for factual queries, higher for creative assistance), and tool configurations (search result count, context window allocation).

Split by session, not by message. A user must experience one variant for their entire conversation. Switching mid-conversation produces meaningless data and a confusing user experience.

Sample size matters. You need a minimum of 500 conversations per variant to detect a meaningful difference in resolution rate^[2]. For quality score, you need closer to 1,000 because the variance is higher. Do not call a test after 50 conversations, no matter how dramatic the difference looks. Small samples lie.

Run one test at a time. Multivariate testing with agents is tempting but the interaction effects between prompt changes and model changes make results nearly impossible to interpret. Change one variable, measure the impact, then move on.

The Review Ritual

Dashboards show trends. Transcripts show truth. You need both, and you need a regular cadence for reviewing them.

Weekly: Read the 10 worst conversations from the past 7 days, ranked by quality score. For each one, identify the root cause (knowledge gap, tool failure, prompt weakness, edge case) and decide whether it needs a fix or is an acceptable limitation. This takes 60 to 90 minutes and is the highest-leverage activity for improving agent quality.

Monthly: Compare this month's metrics against last month's. Look at resolution rate, quality score distribution, and revenue attribution. If you shipped agent changes during the month, isolate their impact. Write a one-paragraph summary of what improved, what regressed, and what to focus on next month.

Quarterly: Recalibrate your quality scoring rubric. As your agent improves, a conversation that scored a 4 six months ago might only deserve a 3 today because your standards should rise with your agent's capabilities. Re-score a random sample of 50 older conversations with the updated rubric and adjust your historical baselines.

The teams that build great agents are not the ones with the best models or the most sophisticated prompts. They are the ones that measure relentlessly, review honestly, and improve incrementally. Measurement is the foundation everything else is built on.

How Scanner Helps

Scanner gives you a baseline measurement of your site's readiness for agent interactions. The Performance, Discoverability, and Accessibility dimensions map directly to agent success metrics: faster pages mean lower tool latency, better structure means higher resolution rates, and accessible content means agents can serve every visitor.

Agent Performance Dashboard

Resolution Rate — 7 Day Trend62% → 74%

New prompt deployed

Click metric cards to expand definitions. Chart shows resolution rate uplift after a prompt change.

The Metrics That Matter

Five metrics give you a complete picture of agent health. Track all five from day one.

Instrumentation

Good measurement starts with good instrumentation. You need event tracking at four key moments in every conversation.

First, conversation start: capture the entry point, user context (authenticated or anonymous, device, referrer), and the first message. This tells you why users are reaching out.

Third, conversation end: capture the resolution status, total messages, duration, and whether the user escalated. This is your outcome signal.

Building Dashboards

Build three dashboards, each serving a different audience.

Error Monitoring

A/B Testing Agent Configurations

Split by session, not by message. A user must experience one variant for their entire conversation. Switching mid-conversation produces meaningless data and a confusing user experience.

The Review Ritual

Dashboards show trends. Transcripts show truth. You need both, and you need a regular cadence for reviewing them.

Measuring Agent Performance

The Metrics That Matter

Instrumentation

Building Dashboards

Error Monitoring

A/B Testing Agent Configurations

The Review Ritual

How Scanner Helps

More from Learn

Why Agents on Your Site

Voice Architecture

Chat Architecture

Measuring Agent Performance

The Metrics That Matter

Instrumentation

Building Dashboards

Error Monitoring

A/B Testing Agent Configurations

The Review Ritual

How Scanner Helps

More from Learn

Why Agents on Your Site

Voice Architecture

Chat Architecture