Realtime voice stack results for Duplex receptionist routing.
The public scorecard for choosing the right realtime voice provider per deployment: managed, multimodal, self-hosted, or pipeline-based.
OpenAI Realtime
Recommended default for the first public receptionist/router benchmark.
Strong default for a public Duplex demo because browser mic to realtime session is already viable, interruption behavior is good, and reasoning quality is high enough for routing decisions.
Test script
Start at Receptionist, route to Ops, interrupt mid-answer, return to Receptionist, then transfer to Research or Builder with context.
Measured outcome
The provider wins only if the route changes are fast, understandable, auditable, and useful to the next specialist.
Safety lens
External actions require confirmation gates. Voice can request work, but approval policy controls execution.
Benchmark rubric
Interruption / barge-in
Can the user cut off the receptionist and redirect the route naturally?
Routing accuracy
Does the stack preserve intent when switching from Receptionist to a specialist?
Handoff quality
Does the next agent receive a useful summary without making the user repeat themselves?
Tool-call readiness
Can the voice layer safely trigger workflow tools behind confirmation gates?
Cost per useful minute
What does a successful routed session cost after retries, latency, and failures?
Deployment control
Can the customer run managed, self-hosted, or hybrid depending on the use case?
OpenAI Realtime
High-quality managed WebRTC demos, receptionist routing, and fast product validation.
Recommended default for the first public receptionist/router benchmark.
Caveat: Managed API dependency and provider pricing make it less ideal for every long-running community session.
Google Gemini Live
Voice sessions that need screen, image, or multimodal context.
Best benchmark lane for multimodal receptionist scenarios.
Caveat: Needs separate validation for Discord-style multi-speaker rooms and production observability.
PersonaPlex / Moshi-style
Lower-cost, private, or self-hosted deployments where infrastructure control matters.
Best strategic candidate for self-hosted/private tiers once routing quality is proven.
Caveat: Requires more integration work around reasoning, tool calls, observability, and hosted reliability.
Pipecat pipeline
Benchmarking and swapping STT, LLM, TTS, transport, and observability components.
Best candidate for the official benchmark harness and adapter layer.
Caveat: Quality depends on the underlying STT, LLM, TTS, and transport choices.
ElevenLabs Conversational AI
Voice quality, brand voice, and customer-facing concierge experiences.
Best candidate for polished voice and brand-sensitive demos.
Caveat: Needs validation for advanced multi-agent routing, custom tooling, and cost per useful minute.
Hume EVI
Emotion-aware support, coaching, wellness, and sensitive conversational experiences.
Best candidate for emotion-aware routing tests.
Caveat: Needs proof against operational routing tasks and structured action handoffs.
Ultravox
Open speech-native experimentation and lower-level control.
Best candidate for watching open speech-native progress.
Caveat: Requires production validation for reliability, latency, and integration maturity.