Point11
Agentic

Chat Agent Architecture

A chat agent is not a chatbot. It is a stateful program that streams responses, calls tools, and manages multi-turn conversations. Here is how to build one.

Most "chatbots" are decision trees wearing a text box as a mask. You click a button, follow a script, and land on a canned response. Agents are fundamentally different. They reason over your input, decide which tools to call, and generate responses that never existed before.

If your chat experience feels like a phone tree, you built a chatbot. If it feels like talking to someone who can actually do things, you built an agent.

Chatbots vs. Agents

Traditional chatbots use pattern matching. They scan user input for keywords, match against a predefined list, and return a scripted reply. Dialogflow, early Intercom bots, and most e-commerce "live chat" widgets work this way. They handle maybe 40 to 60 percent of queries before hitting a dead end and escalating to a human[1].

Agents use LLM-driven reasoning. Instead of matching patterns, the model reads the full conversation, decides what to do next, and generates a response token by token. When it needs information it does not have, it calls a tool. When the tool returns data, the model incorporates it and keeps going.

The difference is not cosmetic. A chatbot cannot handle "Show me running shoes under $120 that come in wide, then book a fitting appointment for Saturday." That requires searching a product catalog, filtering results, understanding a date reference, checking calendar availability, and creating a booking. An agent handles this in a single turn because it can chain tool calls together.

The Streaming Response Pattern

LLMs take time to generate. A typical response runs 3 to 8 seconds end to end[2]. Without streaming, the user stares at a blank screen for the full duration. With streaming, tokens arrive as they are generated, and the first word appears in under 300 milliseconds.

Server-Sent Events (SSE) is the standard transport. The client opens a single HTTP connection. The server pushes chunks of text, tool call deltas, and metadata as they become available. The connection stays open until the response is complete. SSE is simpler than WebSockets, works through CDNs, and handles the unidirectional flow of a chat response naturally.

WebSockets make sense when you need bidirectional communication, like voice or real-time collaboration. For text chat, SSE is the better fit.

The Vercel AI SDK abstracts this cleanly. `streamText()` takes a model, system prompt, messages, and tools, then returns a streaming response you can pipe directly to the client[3]:

```typescript const result = streamText({ model: anthropic("claude-sonnet-4-20250514"), system: systemPrompt, messages, tools, }); return result.toDataStreamResponse(); ```

The client receives tokens in real time. No polling. No loading spinners. The conversation feels alive.

Tool Use (Function Calling)

Tools are what separate an agent from a text generator. The LLM does not just produce words. It decides, mid-response, that it needs to take an action. It emits a structured tool call, the server executes it, and the result feeds back into the model for the next generation step.

A tool definition has four parts:

  • name: a unique identifier the model references (e.g., `searchProducts`)
  • description: plain English explaining what the tool does and when to use it
  • schema: a Zod or JSON Schema defining the parameters the model must provide
  • execute: the function that runs when the model calls the tool

Here is a concrete example:

```typescript const searchProducts = tool({ description: "Search the product catalog by query, category, or price range", parameters: z.object({ query: z.string(), maxPrice: z.number().optional(), category: z.string().optional(), }), execute: async ({ query, maxPrice, category }) => { return await db.products.search({ query, maxPrice, category }); }, }); ```

The model sees the tool name and description in its context. When a user asks "Show me running shoes under $120," the model generates a tool call with `{ query: "running shoes", maxPrice: 120 }`. The server executes the search, returns the results, and the model weaves them into a natural response.

Multi-step tool calls happen when one action depends on another. The model searches for products, then calls `bookMeeting` with the selected product and a time slot. Each step completes before the next begins. The model orchestrates the sequence on its own.

Parallel tool calls happen when actions are independent. The model might search products and check calendar availability at the same time. Anthropic and OpenAI models both support this, and the AI SDK handles the concurrent execution automatically[3].

Conversation Design

The system prompt is your primary control layer. It defines the agent's persona, knowledge boundaries, response format, and behavioral rules. A well-written system prompt is the difference between an agent that stays on task and one that hallucinates product features or agrees to things it should not.

System prompts should be specific. "You are a helpful assistant" produces generic behavior. Compare that to: "You are a product specialist for an athletic footwear brand. You recommend products from the current catalog only. You never discuss competitor products. When asked about availability, call the checkInventory tool." That level of specificity produces an agent that actually works.

Context window management matters because LLMs have finite context. Claude supports up to 200K tokens[4]. GPT-4o supports 128K[5]. Long conversations eventually exceed these limits, and performance degrades well before you hit the ceiling.

The practical solution is a sliding window. Keep the first few messages (they establish context and user intent) and the most recent messages (they contain the active thread). Drop everything in the middle. A common configuration: keep the first 4 messages and the last 36. This preserves the opening context while keeping the active conversation intact.

Message roles structure the conversation:

  • system: instructions the user never sees, loaded once at the start
  • user: the human's input
  • assistant: the model's responses
  • tool: results returned from tool execution

The model uses these roles to understand who said what and what happened. Mixing them up, putting tool results in user messages for instance, confuses the model and degrades response quality.

Build vs. Buy

Three paths exist. The right one depends on how central the agent is to your product.

Build custom when the agent is your product or deeply integrated with proprietary systems. You control every layer: prompts, tools, memory, streaming, and the UI. The cost is engineering time. Plan for 4 to 8 weeks to reach production quality, longer if you need voice or multi-agent coordination.

Buy a platform (Intercom Fin, Ada, Kustomer) when you need a support agent fast and your use case is standard. You get pre-built integrations, analytics, and handoff to human agents. The tradeoff is limited customization. You cannot control the prompt architecture, tool execution model, or response streaming behavior.

The middle ground is increasingly popular: use the Vercel AI SDK, LangChain, or the Anthropic/OpenAI SDKs directly, pair them with your own prompts and tools, and build the UI on top. You get full control over the agent's behavior without building the streaming infrastructure or tool execution runtime from scratch. This is the path most engineering teams should take.

Production Concerns

Shipping an agent to production introduces problems that do not exist in a prototype.

Rate limiting prevents abuse and controls cost. Implement per-user and per-organization limits. A reasonable starting point: 50 messages per user per hour, 200 per organization per minute. Use token bucket or sliding window algorithms. Always return clear error messages when limits are hit.

Prompt caching reduces latency and cost. Anthropic supports ephemeral caching on system prompts and tool definitions[6]. If your system prompt is 4,000 tokens and you send 100 requests per minute, caching saves you 400,000 input tokens per minute. At Anthropic's pricing, that is real money.

Error handling must cover three failure modes: - Model API errors (rate limits, timeouts, server errors): retry with exponential backoff - Tool execution failures (database down, external API timeout): return a graceful error message to the model so it can inform the user - Malformed responses (truncated JSON, incomplete tool calls): validate and retry the generation

Cost is predictable once you measure it. A typical customer service conversation runs 2,000 to 5,000 tokens total (input plus output). At current Anthropic pricing, that is $0.02 to $0.08 per conversation[7]. Prompt caching can cut input costs by up to 90%.

The key metric to track is cost per conversation, not cost per token. Measure it from day one.

How Site Scanner Helps

Site Scanner audits the content and structure that chat agents depend on. Clean semantic HTML, fast server responses, and well-organized product data all translate directly to better agent conversations. The scan report highlights where your site's content foundation is strong and where gaps will cause agent tool calls to fail or return incomplete results.

See how your site scores.

Run a free scan at point11.ai to check your Chat Agent Architecture and 40+ other metrics.

Scan Your Site