How OpenAI delivers low-latency voice AI at scale

May 4, 2026 · ai ai-agents · Source ↗

TLDR

OpenAI rebuilt its WebRTC stack using a stateless relay plus stateful transceiver architecture to achieve low-latency, globally scaled voice AI without the one-port-per-session problem.

The core problem: one-port-per-session WebRTC requires massive public UDP port ranges that break Kubernetes autoscaling, cloud load balancers, and firewall policy.
Solution is a split architecture: a lightweight UDP relay handles public-facing packet forwarding; a Go/Pion transceiver owns all ICE, DTLS, SRTP state and connects to inference backends.
First-packet routing uses the ICE ufrag, encoded with enough metadata for the relay to infer the owning transceiver without an external lookup service.
The relay maintains only a minimal in-memory forwarding session; if it restarts, the next STUN packet rebuilds routing from the ufrag hint automatically.
The transceiver model (not SFU) was chosen because most sessions are 1:1 user-to-model, making SFU multiparty overhead unnecessary.

Commenters drew a sharp line between transport latency (what this article solves) and VAD/turn-detection latency, with some arguing accurate voice activity detection matters more to conversational feel than WebRTC tuning.
Users noted that aggressive barge-in behavior frustrates speakers who pause mid-thought, pointing to turn-taking logic as the more user-facing problem OpenAI has not fully solved.
RFC 9297 and alternatives like Pipecat were raised as paths that could sidestep WebRTC complexity entirely for client-server voice AI scenarios.

@Sean-Der: Pion creator, now at OpenAI, flagged WebRTC for the Curious as a practical entry point for engineers new to the stack.
@hnav: RFC 9297 browser support would remove the need for WebRTC in client-server voice scenarios entirely.