Real-time Speaker Diarization at Scale

Figure 1: Separating audio streams in real-time

The "Cocktail Party" Problem

Humans are great at focusing on one voice in a noisy room. Computers struggle.
In a mono-channel VoIP call (common in legacy telephony), the Agent and Customer are mixed into one stream.

To build an AI Agent that knows when to interrupt, you must know who is talking.

The Architecture

1. Frame-Level Embedding

We slice audio into 500ms windows and pass them through a lightweight encoder (SpeechBrain ECAPA-TDNN).
Output: A 192-dimensional vector describing the "timbre" of that slice.

2. Hybrid Architecture

We moved away from purely in-app diarization (Pyannote) to a Hybrid Approach:

External Diarization: We ingest pre-diarized segments from high-throughput APIs (like Google speech-to-text or OpenAI).
Internal Verification: We run our 192-dim ECAPA-TDNN verification locally to "double-check" and bind identities to known voice profiles.

3. Latency Constraints

Cloud APIs handle the "Who spoke when?" map.
IdentityCall handles the "Is this Agent X?" verification.

Budget: 50ms processing time per frame for verification.
Optimization: We run the encoder in ONNX Runtime on GPU, quantized to INT8.

Handling Overlap

The hardest part is "Overlapping Speech" (approx 10-15% of a call).
Standard models pick the "loudest" speaker.
We use Multi-Label Diarization, assigning two speaker labels to a single timeframe if the embedding suggests a mix.

Conclusion

Good diarization is the prerequisite for good transcription. If you attribute the Agent's "Hello" to the Customer, your whole context starts broken.