Voice agent configuration

Voice agents must feel natural. Human conversation tolerance is a 300-500ms response window. Anything above 500ms feels unnatural and erodes trust. Point11's voice agents target sub-200ms response times, below the .11-second threshold that linguistics research identifies as the universal conversational breakpoint.

The Latency Budget

A voice AI pipeline follows: User Speech > STT > LLM > TTS > Audio Playback. Each stage contributes latency:

STT / ASR: 100-500ms typical, 50-100ms with streaming.
LLM inference: 350ms-1000ms+ typical, 150-300ms with optimized models.
TTS synthesis: 75-200ms typical, 75ms with ElevenLabs Flash v2.5.
Network round-trip: 50-300ms typical, 5-20ms with edge deployment.

Total unoptimized: approximately 1000ms. Optimized: 300-500ms. With edge deployment and speech-to-speech models: under 200ms.

Achieving Sub-200ms Latency

Streaming Pipeline

Process audio as it arrives rather than waiting for the user to finish speaking:

Streaming ASR: transcribes in real time, cutting delay to 100-200ms.
Speculative response generation: starts generating a response before the user finishes, using partial transcription.
Chunk-based TTS: begins playing the first audio chunk while subsequent chunks are still being synthesized.

Together, these techniques reduce perceived latency by 40-60% versus batch processing.

Edge Deployment

Moving processing closer to users eliminates 50-200ms of network latency per round trip:

Co-locate GPUs and telephony infrastructure at global Points of Presence.
Edge caching of common responses reduces total latency by 30-50%.
On-device wake word detection avoids unnecessary round trips.

Model Selection

The LLM is the biggest bottleneck, consuming 40-60% of total response time:

Use smaller, specialized models rather than frontier models for conversational tasks.
Quantized models (INT4/INT8) provide faster inference with minimal quality loss.
Target 50+ tokens per second generation rate.

For TTS, ElevenLabs Flash v2.5 is purpose-built for real-time conversational AI at approximately 75ms inference latency across 32 languages. Use premade or synthetic voices (not instant voice clones) for lowest latency.

Natural Turn-Taking

ElevenLabs Conversational AI 2.0 includes a state-of-the-art turn-taking model that analyzes conversational cues in real time. It detects pauses, filler words ("um," "ah"), and intonation patterns to determine when to speak and when to wait.

This prevents the agent from:

Interrupting the user mid-sentence.
Waiting too long after the user finishes (creating awkward silence).
Responding to background noise or non-speech audio.

Telephony Integration

For inbound and outbound calling:

SIP: is the standard protocol for VoIP/telephony integration.
WebRTC: is preferred for browser and mobile clients with built-in low-latency audio codecs.
Batch calling: enables automated outbound voice campaigns with simultaneous call initiation.

Configure disclosure flows that inform callers they are speaking with an AI agent, as required by TCPA regulations.

Sources

ElevenLabs Latency Optimization: https://elevenlabs.io/docs/developers/best-practices/latency-optimization
ElevenLabs Models Reference: https://elevenlabs.io/docs/overview/models
AssemblyAI — The 300ms Rule: https://www.assemblyai.com/blog/low-latency-voice-ai
Twilio — Core Latency Guide: https://www.twilio.com/en-us/blog/developers/best-practices/guide-core-latency-ai-voice-agents
Cresta — Engineering for Real-Time Voice Agent Latency: https://cresta.com/blog/engineering-for-real-time-voice-agent-latency