Voice agent latency

Latency at Orbit

We publish our voice-agent latency openly, broken down by stage, because customer experience starts here. Every number on this page maps 1:1 to the Prometheus histogram emitted by our voice gateway.

Target p50

1100 ms

Steady-state turn, post-streaming chunker

Current p50

1255 ms

Synthetic test env · 2026-05-13

Target p95

1500 ms

Current p95

1650 ms

Synthetic test env · 2026-05-13

Pre-launch numbers. Measured 2026-05-13 in a synthetic test environment driving 200 sequential turns against Cartesia Sonic + Anthropic Claude Sonnet 4.6. We refresh this page from live Prometheus histograms after the first 1,000 production calls.

Per-stage budget (E1–E5)

Every turn passes through five measurable stages. The total is the sum minus parallel overlap (LLM TTFT and TTS first byte run concurrently once the streaming chunker is engaged).

Stage	What it measures	Target	Current
Endpointing (E1)	Time from the user finishing their utterance to Deepgram's `UtteranceEnd` event. Driven primarily by the `DEVOTEL_VOICE_UTTERANCE_END_MS` setting (default 600 ms).	600 ms	600 ms
STT final (E2)	Time from Deepgram's `UtteranceEnd` to the final `Transcript.is_final = true` event with the full transcript. Includes Deepgram's nova-2 / nova-3 model post-processing.	50 ms	45 ms
LLM TTFT (E3)	Time-to-first-token from Anthropic Claude Sonnet via the agent-runtime `/chat/stream` endpoint. With prompt-cache hit (warm session) we typically see 300-400 ms; cold turn-1 hits 600-1500 ms but is masked by the W10-G pre-warm IIFE.	400 ms	380 ms
First chunk → TTS (E4)	Time from receiving the first streamed LLM chunk to handing it to the TTS pipeline. Pure orchestration latency; bounded by the sentence-chunker's boundary detection.	100 ms	90 ms
TTS first byte (E5)	Time from TTS handoff to the first audio byte arriving from Cartesia / ElevenLabs failover. Sub-150ms with Cartesia Sonic; ElevenLabs Flash v2.5 fallback adds ~50 ms.	150 ms	140 ms

Sum of targets: 1300 ms (parallelism collapses ~200 ms — net p50 target 1100 ms).

Methodology

Every voice turn emits a Prometheus histogramvoice_agent_turn_stage_duration_mswith labelsstage / tenant_org_id / model. The pre-launch numbers on this page come from a 200-turn synthetic test against the same wire path production uses — Deepgram nova-3 STT, Anthropic Claude Sonnet 4.6 via the agent-runtime /chat/stream endpoint, Cartesia Sonic TTS with the ElevenLabs Flash v2.5 failover armed, LiveKit Cloud media routing.

The histogram is fail-open: if the metrics module is absent (boot-check harness, local dev) the timing accumulator silently becomes a no-op, so a misconfigured probe never takes the voice path down. Verification path documented atapps/voice-gateway/src/lib/voice-metrics.ts.

When production telemetry begins emitting, this page refreshes automatically from a scheduled probe against Google Cloud Monitoring. Until then the numbers carry the "synthetic test env" footnote above.

How we compare

Public competitor numbers, attributed inline. Where a vendor's marketed claim is independently measured to be different, we show the independent number.

Orbit (Devotel)1255 ms

Independent · Synthetic test environment, 2026-05-13 · Pre-launch; live histogram refresh after first 1000 prod calls.

Retell AI680 ms

Independent · tested.media March 2026 · Marketed ~600 ms; independent p50 680 ms / p95 920 ms. Best in AI-voice-native cohort.

Vapi720 ms

Independent · tested.media March 2026 · Marketed <500 ms; independent p50 720 ms / p95 1050 ms. Flexibility tax shows in tail latency.

Bland AI850 ms

Independent · tested.media March 2026 · Marketed 400 ms historically (no longer published); independent p50 850 ms / p95 1180 ms.

LiveKit Agents—

Marketed · livekit.io public claim · Claims sub-200 ms with their own global media-server mesh; no independent number published.

ElevenLabs CAI400 ms

Marketed · elevenlabs.io Turbo tier · Turbo v2 ~400 ms marketed (gpt-4o-mini + Flash v2).

OpenAI Realtime500 ms

Marketed · Developer reports, 2026-Q1 · Native speech-to-speech; <500 ms reported in early-access community. No published p95.

What moves the number

Real-world latency is sensitive to factors we don't control from a single benchmark page:

Region. Today we serve voice from europe-west1. A US-East caller eats ~100 ms RTT on every hop. Multi-region (us-east-1 + ap-south-1) is on the 90-day roadmap.
Model selection. Claude Sonnet 4.6 is the production default. Faster classifier-tier models (Haiku, gpt-5-mini) shave 100–200 ms off the LLM TTFT stage for simple intents — tunable per-agent in the dashboard.
Prompt length. Prompt caching (Anthropic extended-cache-ttl-2025-04-11 beta) keeps the system prompt + tool registry pinned for the session. Cold-turn pre-warm via DEVOTEL_VOICE_PREWARM_ENABLED=true heats the cache before turn 1.
Tool calls. Each LLM-initiated tool call adds round-trip latency. Async tool support (W11 backlog) lets long-running tools resolve in the background without blocking the next turn.
Speculative-on-partial. We pre-fire the LLM on a stable interim transcript so the model is already streaming when STT emits final. Disable with DEVOTEL_VOICE_SPECULATIVE=false if you want a deterministic cold-floor.

Test our voice agent yourself

Spin up a sandbox tenant, attach a knowledge base, place a test call — the dashboard renders the per-turn stage histogram inline so you can verify these numbers on your own traffic.

Try voice quickstart See pricing

Per-stage budget (E1–E5)

Every turn passes through five measurable stages. The total is the sum minus parallel overlap (LLM TTFT and TTS first byte run concurrently once the streaming chunker is engaged).

Stage	What it measures	Target	Current
Endpointing (E1)	Time from the user finishing their utterance to Deepgram's `UtteranceEnd` event. Driven primarily by the `DEVOTEL_VOICE_UTTERANCE_END_MS` setting (default 600 ms).	600 ms	600 ms
STT final (E2)	Time from Deepgram's `UtteranceEnd` to the final `Transcript.is_final = true` event with the full transcript. Includes Deepgram's nova-2 / nova-3 model post-processing.	50 ms	45 ms
LLM TTFT (E3)	Time-to-first-token from Anthropic Claude Sonnet via the agent-runtime `/chat/stream` endpoint. With prompt-cache hit (warm session) we typically see 300-400 ms; cold turn-1 hits 600-1500 ms but is masked by the W10-G pre-warm IIFE.	400 ms	380 ms
First chunk → TTS (E4)	Time from receiving the first streamed LLM chunk to handing it to the TTS pipeline. Pure orchestration latency; bounded by the sentence-chunker's boundary detection.	100 ms	90 ms
TTS first byte (E5)	Time from TTS handoff to the first audio byte arriving from Cartesia / ElevenLabs failover. Sub-150ms with Cartesia Sonic; ElevenLabs Flash v2.5 fallback adds ~50 ms.	150 ms	140 ms

Sum of targets: 1300 ms (parallelism collapses ~200 ms — net p50 target 1100 ms).

Methodology

How we compare

Public competitor numbers, attributed inline. Where a vendor's marketed claim is independently measured to be different, we show the independent number.

Orbit (Devotel)1255 ms

Independent · Synthetic test environment, 2026-05-13 · Pre-launch; live histogram refresh after first 1000 prod calls.

Retell AI680 ms

Independent · tested.media March 2026 · Marketed ~600 ms; independent p50 680 ms / p95 920 ms. Best in AI-voice-native cohort.

Vapi720 ms

Independent · tested.media March 2026 · Marketed <500 ms; independent p50 720 ms / p95 1050 ms. Flexibility tax shows in tail latency.

Bland AI850 ms

Independent · tested.media March 2026 · Marketed 400 ms historically (no longer published); independent p50 850 ms / p95 1180 ms.

LiveKit Agents—

Marketed · livekit.io public claim · Claims sub-200 ms with their own global media-server mesh; no independent number published.

ElevenLabs CAI400 ms

Marketed · elevenlabs.io Turbo tier · Turbo v2 ~400 ms marketed (gpt-4o-mini + Flash v2).

OpenAI Realtime500 ms

Marketed · Developer reports, 2026-Q1 · Native speech-to-speech; <500 ms reported in early-access community. No published p95.

What moves the number

Real-world latency is sensitive to factors we don't control from a single benchmark page:

Region. Today we serve voice from europe-west1. A US-East caller eats ~100 ms RTT on every hop. Multi-region (us-east-1 + ap-south-1) is on the 90-day roadmap.

Model selection. Claude Sonnet 4.6 is the production default. Faster classifier-tier models (Haiku, gpt-5-mini) shave 100–200 ms off the LLM TTFT stage for simple intents — tunable per-agent in the dashboard.

Prompt length. Prompt caching (Anthropic extended-cache-ttl-2025-04-11 beta) keeps the system prompt + tool registry pinned for the session. Cold-turn pre-warm via DEVOTEL_VOICE_PREWARM_ENABLED=true heats the cache before turn 1.

Tool calls. Each LLM-initiated tool call adds round-trip latency. Async tool support (W11 backlog) lets long-running tools resolve in the background without blocking the next turn.

Speculative-on-partial. We pre-fire the LLM on a stable interim transcript so the model is already streaming when STT emits final. Disable with DEVOTEL_VOICE_SPECULATIVE=false if you want a deterministic cold-floor.