Official benchmark

Realtime voice stack results for Duplex receptionist routing.

The public scorecard for choosing the right realtime voice provider per deployment: managed, multimodal, self-hosted, or pipeline-based.

Current leader

OpenAI Realtime

Recommended default for the first public receptionist/router benchmark.

Strong default for a public Duplex demo because browser mic to realtime session is already viable, interruption behavior is good, and reasoning quality is high enough for routing decisions.

Try live demo Open playground

Composite score

7.7/10

Barge-in9/10

Routing9/10

Handoff8/10

Tools8/10

Cost6/10

Control6/10

Test script

Start at Receptionist, route to Ops, interrupt mid-answer, return to Receptionist, then transfer to Research or Builder with context.

Measured outcome

The provider wins only if the route changes are fast, understandable, auditable, and useful to the next specialist.

Safety lens

External actions require confirmation gates. Voice can request work, but approval policy controls execution.

Benchmark rubric

Interruption / barge-in

Can the user cut off the receptionist and redirect the route naturally?

Routing accuracy

Does the stack preserve intent when switching from Receptionist to a specialist?

Handoff quality

Does the next agent receive a useful summary without making the user repeat themselves?

Tool-call readiness

Can the voice layer safely trigger workflow tools behind confirmation gates?

Cost per useful minute

What does a successful routed session cost after retries, latency, and failures?

Deployment control

Can the customer run managed, self-hosted, or hybrid depending on the use case?

Live demo pathManaged speech-to-speech model

OpenAI Realtime

High-quality managed WebRTC demos, receptionist routing, and fast product validation.

Recommended default for the first public receptionist/router benchmark.

Result snapshot

7.7/10

Barge-in9

Routing9

Handoff8

Tools8

Cost6

Control6

Caveat: Managed API dependency and provider pricing make it less ideal for every long-running community session.

CandidateManaged multimodal realtime model

Google Gemini Live

Voice sessions that need screen, image, or multimodal context.

Best benchmark lane for multimodal receptionist scenarios.

Result snapshot

7.2/10

Barge-in8

Routing8

Handoff8

Tools7

Cost6

Control6

Caveat: Needs separate validation for Discord-style multi-speaker rooms and production observability.

CandidateSelf-hosted full-duplex candidate

PersonaPlex / Moshi-style

Lower-cost, private, or self-hosted deployments where infrastructure control matters.

Best strategic candidate for self-hosted/private tiers once routing quality is proven.

Result snapshot

7.2/10

Barge-in8

Routing6

Handoff6

Tools5

Cost9

Control9

Caveat: Requires more integration work around reasoning, tool calls, observability, and hosted reliability.

Pipeline layerProvider-agnostic realtime pipeline

Pipecat pipeline

Benchmarking and swapping STT, LLM, TTS, transport, and observability components.

Best candidate for the official benchmark harness and adapter layer.

Result snapshot

7.5/10

Barge-in7

Routing7

Handoff7

Tools8

Cost7

Control9

Caveat: Quality depends on the underlying STT, LLM, TTS, and transport choices.

CandidateManaged voice agent platform

ElevenLabs Conversational AI

Voice quality, brand voice, and customer-facing concierge experiences.

Best candidate for polished voice and brand-sensitive demos.

Result snapshot

6.2/10

Barge-in7

Routing7

Handoff6

Tools6

Cost6

Control5

Caveat: Needs validation for advanced multi-agent routing, custom tooling, and cost per useful minute.

CandidateManaged empathic voice interface

Hume EVI

Emotion-aware support, coaching, wellness, and sensitive conversational experiences.

Best candidate for emotion-aware routing tests.

Result snapshot

6/10

Barge-in7

Routing6

Handoff7

Tools5

Cost6

Control5

Caveat: Needs proof against operational routing tasks and structured action handoffs.

CandidateSpeech-native model candidate

Ultravox

Open speech-native experimentation and lower-level control.

Best candidate for watching open speech-native progress.

Result snapshot

6.8/10

Barge-in7

Routing6

Handoff6

Tools6

Cost8

Control8

Caveat: Requires production validation for reliability, latency, and integration maturity.