Voice AI in 2026 is in the rare phase where every piece of the stack is genuinely good. Speech-to-text models hit human-level accuracy on common speech. LLMs are conversationally fluent. Text-to-speech models like Cartesia's Sonic-3 stream first audio in under 100ms with prosody that lands. Real-time audio infrastructure is a commodity. The combination unlocks applications — voice agents for customer support, voice-first AI companions, embedded conversational interfaces — that just did not work even 18 months ago.
What it does not unlock is the ability to skip past the engineering. A voice agent that feels good is the product of many small decisions about latency budgets, barge-in, turn detection, and provider selection that none of the demo apps surface. This guide is the production-grade version: what to pick, why, and where the sharp edges are.
The Two Stack Choices
The first architectural decision is the most consequential: cascaded vs end-to-end.
The cascaded stack (STT → LLM → TTS)
You wire three providers (or models) together. The user's audio goes to a streaming speech-to-text service like Deepgram, AssemblyAI, or self-hosted Whisper. The transcribed text streams into an LLM (Anthropic Claude, OpenAI, Gemini, an open model). The LLM's streamed output goes to a TTS service like Cartesia Sonic, ElevenLabs, or OpenAI TTS. Audio streams back to the user.
When this is right: You need control over the LLM (specific model, custom system prompt, tool calling), you support languages outside the end-to-end model's coverage, you want to mix vendors based on cost or capability, or you're building a voice interface on top of an existing AI product.
The end-to-end stack (speech-in / speech-out single model)
The user's audio stream goes directly to a single multimodal model — OpenAI Realtime, Google Live API, or in-the-works competitors — and audio comes back. Speech is treated as a first-class modality alongside text, not piped through transcription.
When this is right: Latency matters a lot, you want emotional nuance and prosody handling, the language coverage matches your users, and you don't need surgical control over the underlying LLM. End-to-end is the right default for consumer voice apps in 2026 where the conversational feel is the product.
| Dimension | Cascaded (STT→LLM→TTS) | End-to-End (e.g. OpenAI Realtime) |
|---|---|---|
| Median latency | 600–1000ms | 300–600ms |
| LLM swappability | Full | None |
| Voice variety | High (provider library) | Limited preset voices |
| Prosody / emotional nuance | Decent | Best-in-class |
| Multilingual coverage | Per-provider, broad | English-first; growing |
| Per-minute cost (est.) | $0.03–$0.10 | $0.10–$0.30 |
| Observability | High (text in middle) | Low (no transcript) |
| Best for | Enterprise voice agents, multilingual, complex tool use | Consumer voice apps, conversational AI, low-latency feel |
A practical heuristic: if you can write down the LLM you want to use and would feel constrained without it, you want cascaded. If you can't easily articulate why you'd care about the LLM choice, you probably want end-to-end.
The Latency Budget (and Why It's Tighter Than You Think)
Voice AI lives or dies by latency. Humans expect a response 300–800ms after they finish speaking. Below 300ms feels unnaturally fast and creates a "cuts you off" impression. Above 1.2 seconds feels noticeably slow and breaks conversational rhythm. The brain is unforgiving here.
The whole budget has to be split across your pipeline. A workable allocation for a cascaded stack targeting ~700ms end-to-end:
- Voice activity detection / turn detection: 100–200ms after user stops speaking (this is the "how do I know they're done?" wait)
- Streaming STT final transcript: 50–150ms (with provider streaming)
- LLM time-to-first-token: 150–400ms (depends heavily on model and provider region)
- TTS first-audio latency: 50–150ms (Cartesia Sonic streams at the lower end of this)
- Network and audio playback startup: ~50ms
The single biggest budget consumer is usually LLM time-to-first-token. That's where the leverage is. Pick a model with low TTFT, place it geographically near your users, and design prompts to stream meaningful content from token one. The trick that converts most cascaded pipelines from "slow" to "snappy" is sending TTS request fragments as soon as the LLM has produced a complete sentence — not waiting for the full response.
Speech-to-Text in 2026: What Actually Works
STT is the most commoditized component of the voice stack. The model quality from leading providers is close enough that the differentiator is latency, streaming behavior, and language coverage.
| Provider / model | Strengths | Trade-offs |
|---|---|---|
| Deepgram Nova-3 | Lowest streaming latency, strong English accuracy, great real-time API | Multilingual coverage narrower than alternatives |
| Deepgram Aura (TTS too) | Same provider as STT, integrated latency | Voice library smaller than Cartesia/ElevenLabs |
| AssemblyAI Universal-2 | Great non-English accuracy, batch + streaming, strong PII redaction | Streaming latency slightly higher than Deepgram |
| OpenAI Whisper (self-hosted) | Open weights, full control, multilingual | You operate the inference; streaming requires custom work |
| Google Cloud Speech v2 | Strong multilingual, well-integrated into GCP stack | Streaming API ergonomics weaker than Deepgram |
| Microsoft Azure Speech | Enterprise compliance, strong real-time | Heavier integration burden, costs scale fast |
For most production voice apps targeting English-first usage, Deepgram is the default in 2026 for the same reasons it was in 2024: lowest latency streaming, well-built real-time API, and reliable accuracy. For multilingual, AssemblyAI is often the better choice. Whisper is excellent for offline / privacy-sensitive use cases but rarely competitive for low-latency real-time.
Text-to-Speech in 2026: The Cartesia & ElevenLabs Era
This is the part of the stack that changed the most in the past 18 months. Until late 2024, TTS was the long pole in voice AI latency — first audio could take 300-500ms even from streaming providers. Cartesia's Sonic models changed the math by getting first audio under 100ms with a state-space-model architecture that streams natively.
| Provider | Best at | Sweet spot |
|---|---|---|
| Cartesia Sonic-3 | Lowest first-audio latency (~80ms), great prosody for the speed | Voice agents where latency is the primary product feel |
| ElevenLabs v3 | Best voice cloning, expressive emotion, broadest voice library | Apps where voice character or cloning is the differentiator |
| OpenAI TTS (gpt-4o-tts) | Convenience if you're already on OpenAI, decent prosody | Cascaded apps where OpenAI is your LLM |
| Deepgram Aura | Same-provider latency benefits when paired with Deepgram STT | Telephony and call-center applications |
| Inworld (Suno-adjacent) | Character voices, game-style emotion control | Game characters and entertainment apps |
| Self-hosted (XTTS-v3, etc.) | No per-character cost, full control, privacy | Privacy-critical or extreme-scale apps with infra capacity |
The pattern most production voice agents converge on in 2026 is: Cartesia Sonic for general latency-sensitive voice agents, ElevenLabs for premium consumer apps where voice character matters, and self-hosted XTTS-v3 only when there's a hard privacy or cost reason. For more on Cartesia specifically, see our Cartesia company profile.
The Hard Problems: Turn Detection and Barge-In
If you build the cascaded stack above, picking providers is the easy part. The hard part — the part that determines whether your agent feels good or feels broken — is turn handling.
Turn detection (knowing when the user finished talking)
Simple voice activity detection ("there's audio / there's no audio") is not enough. A user pausing mid-thought is not a user finishing a sentence. The agent that responds to every pause feels twitchy and rude.
Modern turn detection uses semantic VAD — small classifier models trained on conversational data that consider both audio signal and the in-progress transcript to decide whether the user is actually finished. LiveKit's turn detection model and the open-source semantic-vad-1.5b are both reasonable defaults. Without semantic turn detection, your latency budget gets eaten by waiting longer to be safe.
Barge-in (the user interrupts the agent)
The user starts talking while the agent is mid-sentence. The agent has to: detect the user's speech, abort the in-flight TTS, cancel the audio playback, ignore the partial audio leaking back through the microphone, and process the user's new turn cleanly. Most voice agent failures in the wild are barge-in failures.
The pattern that works:
- Run VAD continuously on the inbound mic stream, even while playing TTS audio.
- When user speech is detected with confidence above a threshold, immediately stop the audio playback and abort the in-flight TTS request.
- Filter the user's audio for echo / acoustic feedback (the user hearing their own audio leaking back).
- Don't count the few hundred ms of overlap audio as a turn — wait for clean user speech before starting transcription in earnest.
Frameworks like LiveKit Agents and Pipecat have made this dramatically easier than building it yourself. Unless you have a very specific reason, use one of them. The amount of audio plumbing required to do barge-in correctly without a framework is a six-month project.
Frameworks Worth Using
- LiveKit Agents — Python and Node. Tight integration with LiveKit's WebRTC infrastructure. Good defaults for VAD, turn detection, barge-in. The pragmatic default for most teams in 2026.
- Pipecat — Python, modular pipeline architecture. Easier to compose unusual provider combinations or custom processing nodes. Slightly more setup than LiveKit Agents but more flexibility.
- Vapi / Retell AI — Higher-level platforms. You configure rather than code. Good for fast experimentation but harder to extend when your product gets specific.
- OpenAI Realtime SDK — If you've decided on end-to-end with OpenAI, their SDK is the obvious starting point. WebRTC under the hood. Mature.
Evaluation: How to Know Your Voice Agent Is Good
Evaluating voice agents is materially harder than evaluating text-based LLM apps. The conversation has temporal structure. Quality depends on prosody, pacing, and barge-in handling — not just transcript accuracy.
The eval framework that works for most production voice apps in 2026:
- Transcript-level evaluation. Convert the conversation to text and apply standard LLM eval techniques — correctness, tool use accuracy, instruction-following. See our broader LLM evaluation guide for the methodology.
- Latency tracking. Time-to-first-audio for each agent turn, end-to-end loop time, distribution of pauses and gaps. Latency regressions are the most common voice quality regression and easy to miss.
- Conversation-level evaluation. Human raters or LLM-as-judge scoring the conversation as a whole — did the agent feel natural? did it interrupt awkwardly? did it understand barge-in?
- Production logging with audio. Sample real production conversations (with permission) and review them weekly. Voice failures show up qualitatively in ways that aggregate metrics miss.
Build voice AI? See open roles at the leaders
Live engineering openings at Cartesia, ElevenLabs, Deepgram, and other voice / AI infrastructure companies across our directory.
Browse AI / ML Roles → More AI Engineering Guides →The Production Patterns That Save You
Three patterns we see across well-run voice AI products in 2026:
Start the TTS request before the LLM is done
The latency trick that compounds best is interleaving. Once the LLM has produced a complete sentence, send it to TTS while the LLM continues generating. By the time the LLM finishes the response, the first audio has already started playing. This collapses 200–400ms of perceived latency for free.
Treat the system prompt as a state machine, not a paragraph
Voice agents that work consistently treat the system prompt as a structured set of rules with explicit branches: "if user asks for X, do Y," "if user has not provided Z yet, ask for it," "if the user interrupts, summarize what you understood before continuing." Conversational LLMs handle this much better than they handle vague prose like "be helpful and friendly." This is especially true for agents doing structured tasks — appointment booking, customer support triage, lead qualification.
Log everything (legally)
Voice apps go off the rails in surprising ways — an unusual accent, a noisy environment, a partial sentence that the LLM misinterpreted. You can't debug what you can't replay. Log the audio (with consent and proper retention), the transcripts, the LLM tool calls, the latency metrics, and the user feedback signals. Without this, voice agent debugging is essentially blind.
Where Voice AI Is Going Next
Three things to watch over the next 12 months:
- End-to-end models close the multilingual gap. OpenAI Realtime and Google Live API still trail cascaded stacks in non-English coverage. That gap is the single biggest reason cascaded stacks remain the default for global apps. It will close.
- Voice cloning regulation tightens. Several jurisdictions are introducing consent requirements for voice cloning. Apps that depend on user-supplied voice samples should plan for the compliance work now, not later.
- On-device voice models become viable. The Sonic-3-tier of quality at small sizes is making real-time on-device TTS work for the first time. This unlocks offline / privacy-critical voice apps and dramatically changes cost economics for high-scale use cases.
The big picture: voice AI in 2026 finally feels like a real product surface, not a demo. The engineering work is non-trivial but well-trodden — you can ship a voice agent that genuinely feels natural in a quarter, not a year. The teams that win in the next phase are the ones that take the production patterns above seriously and build evaluation infrastructure as carefully as model infrastructure.