Real-time Speaker Diarization at Scale
July 20, 2025
•min read
Engineering
Figure 1: Separating audio streams in real-time
The "Cocktail Party" Problem
Humans are great at focusing on one voice in a noisy room. Computers struggle.
In a mono-channel VoIP call (common in legacy telephony), the Agent and Customer are mixed into one stream.
To build an AI Agent that knows when to interrupt, you must know who is talking.
The Architecture
1. Frame-Level Embedding
We slice audio into 500ms windows and pass them through a lightweight encoder (SpeechBrain ECAPA-TDNN).
Output: A 192-dimensional vector describing the "timbre" of that slice.
2. Hybrid Architecture
We moved away from purely in-app diarization (Pyannote) to a Hybrid Approach:
- External Diarization: We ingest pre-diarized segments from high-throughput APIs (like Google speech-to-text or OpenAI).
- Internal Verification: We run our 192-dim ECAPA-TDNN verification locally to "double-check" and bind identities to known voice profiles.
3. Latency Constraints
Cloud APIs handle the "Who spoke when?" map.
IdentityCall handles the "Is this Agent X?" verification.
- Budget: 50ms processing time per frame for verification.
- Optimization: We run the encoder in ONNX Runtime on GPU, quantized to INT8.
Handling Overlap
The hardest part is "Overlapping Speech" (approx 10-15% of a call).
Standard models pick the "loudest" speaker.
We use Multi-Label Diarization, assigning two speaker labels to a single timeframe if the embedding suggests a mix.
Conclusion
Good diarization is the prerequisite for good transcription. If you attribute the Agent's "Hello" to the Customer, your whole context starts broken.
Tags: