When you talk to another person, the gap between when they stop speaking and when you start is about 200 milliseconds[1]. That is faster than a blink. Voice agents need to hit that same window to feel natural, which means every component in the pipeline has a hard latency budget and zero room for architectural slop.
The Latency Budget
The voice pipeline has three sequential stages, and each one gets a fraction of that 200ms turn-taking gap:
- Speech-to-Text (STT): ~100ms to transcribe what the user said
- LLM Processing: ~200ms to generate the first token of a response
- Text-to-Speech (TTS): ~100ms to begin speaking the response
Total: roughly 400ms end to end. That is tight but achievable with streaming at every stage. The trick is that you do not wait for one stage to finish before starting the next.
STT streams partial transcripts to the LLM. The LLM streams tokens to TTS. TTS streams audio to the speaker. Each stage overlaps with the next.
Compare this to a traditional phone IVR system, which takes 2 to 4 seconds between your input and a response[2]. That gap is why people hate calling their bank. Modern voice agents cut it by an order of magnitude.
Speech-to-Text
The first job is converting raw audio into text, and it has to happen while the user is still speaking.
Streaming transcription sends audio chunks over a WebSocket connection as they arrive from the microphone. The STT model returns partial transcripts in real time, refining them as more audio comes in. By the time the user finishes a sentence, most of it is already transcribed.
Endpointing is the problem of deciding when the user has stopped talking. Too aggressive and you cut them off mid-sentence. Too conservative and you add dead air. Good endpointing models use a combination of silence duration (typically 300 to 500ms), prosodic cues like falling intonation, and syntactic completeness.
Deepgram leads on latency at roughly 100ms for streaming transcription[3]. OpenAI Whisper is more accurate for long-form audio but adds latency that makes it less suited for real-time conversation. For voice agents, speed wins over marginal accuracy gains.
Voice Activity Detection (VAD) filters out background noise, keyboard clicks, and ambient sound so the STT model only processes actual speech. Without it, the pipeline triggers on every car horn and dog bark.
LLM Processing
The LLM stage uses the same foundation models as text chat. Claude, GPT-4, and others all work. But voice puts unique constraints on how you use them.
Shorter prompts matter. Every token in the system prompt adds latency to the first response token. Voice agent prompts are typically 30 to 50% shorter than their chat equivalents. You strip out formatting instructions (no markdown in speech), reduce few-shot examples, and keep the persona definition tight.
Token streaming is non-negotiable. The TTS engine starts generating audio as soon as the first tokens arrive. A voice agent that waits for the complete LLM response before speaking would add seconds of silence.
Fill phrases solve the tool call problem. When the LLM needs to call a function (check inventory, look up an order), there is a pause while the tool executes. Good voice agents generate filler like "Let me check that for you" or "One moment" before the tool call begins. This keeps the conversation flowing instead of dropping into awkward silence.
The target is time to first token (TTFT) under 200ms. Anthropic and OpenAI both hit this consistently on their fastest model tiers when prompts are kept lean[4].
Text-to-Speech
TTS converts the LLM's text output into spoken audio. The quality gap between good and bad TTS is the difference between talking to a person and talking to a GPS from 2008.
Neural TTS models generate speech by predicting audio waveforms from text, producing natural intonation, pacing, and emphasis. Concatenative TTS, the older approach, stitches together pre-recorded phoneme fragments and sounds robotic by comparison. Every modern voice agent uses neural models.
ElevenLabs is the current performance leader for real-time voice synthesis. Their Turbo v2.5 model delivers sub-100ms latency for streaming TTS, supports 29 languages, and offers voice cloning from as little as 30 seconds of reference audio[5]. The cloning capability matters for brands that want a consistent, recognizable voice identity.
Streaming TTS works like streaming STT but in reverse. As tokens arrive from the LLM, the TTS engine converts them to audio chunks and sends them to the client immediately. The user hears the first word of the response while the LLM is still generating the rest.
Voice consistency across turns is harder than it sounds. Each TTS inference is independent, so the same text can sound slightly different each time. ElevenLabs handles this with voice embeddings that lock the style, pitch, and cadence to a fixed identity.
Real-Time Transport
Getting audio from the user's microphone to your servers and back requires a transport layer built for real-time media, not HTTP.
WebRTC is the standard protocol for real-time audio and video on the web[6]. It handles codec negotiation, NAT traversal, packet loss concealment, and adaptive bitrate. Browsers ship with native WebRTC support, so there is no SDK to install on the client side.
ElevenLabs Conversational AI wraps this into a managed service. You configure a voice agent with a system prompt and tools, then connect clients via a WebSocket that handles the full STT, LLM, and TTS pipeline. The client sends audio and receives audio. Everything in between is managed.
Signed URL authentication prevents unauthorized use of your voice agent. Instead of embedding API keys in client code, the server generates a short-lived signed URL (typically valid for 10 to 15 minutes) that the client uses to establish the WebSocket connection. When it expires, the client requests a fresh one.
Interruption Handling
Humans interrupt each other constantly. Voice agents need to handle it gracefully or they feel like talking to a wall.
Barge-in is when the user starts speaking while the agent is still talking. There are three strategies:
- Hard cut: Immediately stop the agent's audio and begin processing the user's speech. Fast and responsive, but can feel abrupt if the user was just saying "uh-huh" or agreeing.
- Soft fade: Quickly fade the agent's volume while transitioning to listening mode. Feels more natural but adds 100 to 200ms of transition time.
- Queue: Let the agent finish its current sentence, then process the interruption. Works for short agent responses but feels unresponsive on long ones.
Most production voice agents use hard cut as the default with heuristics to ignore non-speech sounds and brief acknowledgments.
Silence detection is the flip side. When the user goes quiet, the agent needs to decide: are they thinking, or are they done? Too eager and you talk over someone gathering their thoughts. Too patient and the conversation stalls. The standard approach is a tiered timeout: short silence (500ms) after a question from the agent, longer silence (1.5 to 2 seconds) after an open-ended prompt[7].
The best voice agents feel like a real conversation because every one of these components is tuned to match human speech timing. The pipeline is only as fast as its slowest stage, so architectural decisions at every layer compound into the experience the user feels.
How Site Scanner Helps
Site Scanner measures the performance and content structure signals that voice agents depend on. Fast page loads mean faster tool responses during conversation. Clean HTML and structured data mean the agent can retrieve accurate product information, pricing, and policies in real time.