Make Agents Fast in Production.

The gap between a working agent and one customers love comes down to milliseconds. A chatbot that responds in three seconds feels sluggish. One that responds in 100 milliseconds feels intelligent.

The difference is engineering. You can make an agent feel dramatically faster without changing the model, just by optimizing retrieval, caching, and deployment. Below, we rank six optimizations by latency impact.

Reducing Agent Latency

Default

~680ms

Better

~270ms

Best

~50ms

Up to 13x faster with the same model.

1. Semantic Caching · ~6,400ms saved (65x on cache hit)

When customers ask similar questions, the agent remembers the answer instead of figuring it out from scratch every time. Research suggests^[1] 31% of LLM queries are semantically similar. Semantic caching checks incoming queries against cached pairs, and when a match clears the threshold, the cached response comes back without hitting retrieval or the LLM. Example: a retail agent fielding "what's your return policy?" and "how do I return something?" can serve the same cached answer instantly instead of running the full pipeline twice.

2. Prompt Caching · 80–85% reduction on LLM call

Every time an agent responds, it re-reads the same set of background instructions — prompt caching lets it skip that repeated work. Most agents send the same system prompt, tool definitions, and knowledge context on every call. With Anthropic's prompt caching^[3], that prefix is processed once and reused, cutting cache reads to $0.30/M tokens vs $3.00/M for fresh tokens with up to 85% latency reduction. OpenAI takes an automatic approach^[4] with up to 50% cost reduction and 80% latency reduction with no developer effort.

3. Edge Caching & Deployment · ~100–500ms saved

Some answers don't change and can be stored closer to the customer so they load almost instantly. Example: a travel agent answering "what's included in the premium cabin?" can serve that answer from edge-cached content in under 15ms instead of running a full retrieval pipeline. Vercel Edge Config^[5] gives sub-15ms reads at P99, Cloudflare Workers KV^[6] offers single-digit millisecond latency, and edge deployment eliminates network round trips that add ~70ms before computation even starts. Vercel's Fluid Compute^[7] cuts cold starts by up to 60% through bytecode caching^[8] and predictive warming, while Cloudflare Workers AI^[9] runs models across GPUs in 180+ cities.

4. Reranking · +20–50ms cost · Accuracy lever

After the agent pulls a list of possible answers, reranking is a second pass that picks the best one. This one adds latency — it earns its spot anyway. Example: a customer asks "can I use my warranty if I bought through a reseller?" Initial retrieval pulls 50 chunks mentioning warranties, resellers, and purchase policies. Without reranking, the agent might cite a generic warranty page. With reranking, the cross-encoder identifies the specific clause about third-party purchases and surfaces it first. The accuracy improvement is often the difference between a correct answer and a hallucination.

5. Hybrid Search · Marginal latency savings · Accuracy lever

Hybrid search combines two ways of finding information — matching meaning and matching exact words — so the agent doesn't miss results that use specific names or numbers. Example: a customer asks "do you have the Nike Pegasus 41 in size 10?" Pure vector search might return results about running shoes generally, missing the exact product. Hybrid search matches "Pegasus 41" and "size 10" literally while the vector component handles intent. The accuracy gain on product names, model numbers, and domain terminology makes skipping it a false economy.

6. Chunk Strategy · Indirect · Foundational

Chunking is how you break up your knowledge base into bite-sized pieces the agent can search through — get it wrong and the agent can't find the right answer no matter how fast everything else is. Example: a financial services agent ingesting 200-page compliance documents as single large chunks will either retrieve too much irrelevant context or miss the specific regulation a customer is asking about. Start with 256–512 token chunks with 50-token overlap. Fixed-size works for homogeneous content, semantic chunking for long-form docs, hierarchical for mixed query types.

Users abandon chatbots that feel slow, and they come back to ones that feel instant. The model is the same one everyone else uses. The difference is everything around it.