Benchmarks 2026-05-127 min read

Why Duplex benchmarks realtime voice stacks instead of betting on one provider

The market is moving too fast to hard-code one model. Duplex should compare OpenAI Realtime, Gemini Live, PersonaPlex, Pipecat pipelines, and emerging speech-native models against the actual use case: routed voice agents for communities and agent teams.

Thesis

Duplex benchmarks and routes across the best realtime voice stacks, then lets teams deploy the right one for the use case.

The provider is not the product

Realtime voice is becoming a crowded layer: OpenAI Realtime, Google Gemini Live, PersonaPlex/Moshi-style self-hosted models, Pipecat pipelines, ElevenLabs, Hume EVI, Ultravox, and whatever launches next quarter. A voice agent platform that binds itself to one stack becomes obsolete as soon as latency, price, tool reliability, or voice quality shifts.

Duplex should treat providers as interchangeable candidates behind one product promise: a full-duplex voice front desk that understands intent, routes users to the right human or agent, preserves context, and logs what happened.

The real benchmark is the receptionist loop

Most model demos ask whether the model can hold a conversation. Duplex needs a harder test: can the voice layer handle the receptionist loop?

A user should be able to say “I want Ops,” interrupt mid-answer, ask to go back to the receptionist, switch to Research, then route to Builder with a clean handoff summary. If the provider cannot preserve that loop under latency, interruption, and transcript pressure, it is not ready for Duplex’s core use case.

Why this matters for agent teams

Modern agent systems already have the ingredients Duplex can leverage: profiles, tools, skills, memory, MCP servers, cron jobs, webhooks, gateway channels, and approval policies. What they do not naturally have is a realtime voice layer that feels like a live front desk.

The benchmark should therefore measure more than raw speech quality. It should measure whether a provider can help Duplex route voice into existing agent infrastructure safely and repeatably.

A practical provider strategy

OpenAI Realtime is the strongest default demo path today because it gives Duplex a high-quality managed WebRTC session with good reasoning and interruption behavior. PersonaPlex is the self-hosted candidate for low-cost/private deployments. Gemini Live is the multimodal candidate when voice needs screen or visual context. Pipecat is the orchestration layer for swapping STT, LLM, TTS, and transport components. ElevenLabs, Hume, and Ultravox belong in the candidate bench for voice quality, empathic tone, and open speech-native exploration.

That mix lets Duplex sell flexibility without sounding vague: choose the best provider per deployment, not per marketing cycle.

What Duplex should publish

A blog gives Duplex a public research surface. Each post can compare candidates against the same receptionist routing script, publish latency and failure notes, and explain which provider fits which domain: Discord communities, e-commerce support, creator groups, DevOps triage, or autonomous agent teams.

This creates credibility and turns the playground into a living benchmark instead of a static demo page.

Benchmark rubric

Barge-in / interruption

Why: Full-duplex should let users redirect the conversation without waiting for a turn to end.

How: Run the same mid-answer interruption script and score stop time, recovery quality, and route accuracy.

Routing accuracy

Why: The receptionist has to identify when to answer directly, route to a specialist, or return to home base.

How: Use a fixed set of routing prompts across agent-team, Discord, e-commerce, and DevOps scenarios.

Handoff summary quality

Why: Users should not repeat themselves after switching agents.

How: Compare summaries for goal, facts, open questions, sensitive data, and permission state.

Tool-call readiness

Why: Duplex must eventually trigger workflow tools, not just chat.

How: Measure whether the provider can support confirmation gates and structured action events.

Cost per useful minute

Why: Raw audio minute pricing hides retries, latency, and failed sessions.

How: Normalize cost by successful routed conversation, not by generated seconds.

Deployment control

Why: Some customers need managed quality, others need self-hosting or data control.

How: Score API dependency, self-hosting path, observability, and operational burden.

Official benchmark

See the realtime voice stack results.

Review the public Duplex scorecard for PersonaPlex, OpenAI Realtime, Gemini Live, Pipecat, and other SOTA candidates.

See benchmark results Open playground