Measuring and Optimizing Latency in AI Voice Calls
A practical latency guide: where delay comes from in duplex AI voice, what to measure, and optimizations that preserve natural conversation.
Measuring and Optimizing Latency in AI Voice Calls (Asterisk + VoiceBridge)
In AI voice systems, latency is everything. A delay of 200–300ms feels natural. A delay of 800ms feels robotic. Beyond 1.2 seconds, conversations collapse into awkward turn-taking.
When connecting Asterisk / FreePBX to AI through MYLINEHUB VoiceBridge, latency is not a single number — it is the sum of multiple micro-delays across:
- RTP packetization and jitter buffering
- ARI event handling
- Audio encoding/decoding
- Network travel time
- AI STT processing
- LLM inference
- TTS synthesis
- RTP re-injection timing
This article explains how to measure latency correctly, where it originates, and how VoiceBridge architecture minimizes it.
Architecture reference: MYLINEHUB VoiceBridge Architecture
Open-source project: mylinehub-voicebridge (GitHub)
Understanding End-to-End Voice Latency
End-to-end AI voice latency can be visualized as:
Total latency = network + buffering + AI inference + audio regeneration.
Latency Sources in Detail
1. RTP Frame Duration
Telephony RTP typically uses 20ms frames (G.711). Larger frames increase latency.
Packet handling in:
rtp/RtpPacketizer.java
2. Jitter Buffer Delay
Asterisk jitter buffer can add 40–120ms depending on configuration.
3. STT Processing Delay
Streaming STT reduces delay versus batch transcription. VoiceBridge is designed for streaming audio input.
4. LLM Inference Time
Response time depends on token generation speed. Use smaller models for ultra-low-latency use cases.
5. TTS Generation
Streaming TTS significantly reduces playback wait time.
How VoiceBridge Minimizes Latency
Symmetric RTP Endpoint
Implemented in:
rtp/RtpSymmetricEndpoint.java
Eliminates NAT-induced delay and retransmission attempts.
Efficient RTP Packetizer
rtp/RtpPacketizer.java ensures:
- Monotonic timestamps
- Consistent 20ms pacing
- Minimal buffering
Direct ARI Event Handling
ARI control layer:
ari/impl/AriBridgeImpl.java
Reduces bridge creation delay and media negotiation overhead.
Measuring Latency Correctly
Method 1 — Waveform Echo Test
Play a click tone and measure time until AI response is heard.
Method 2 — RTP Timestamp Comparison
Use Wireshark to:
- Capture inbound RTP
- Capture outbound RTP
- Compare timestamp delta
Method 3 — Application Logging
Add timing logs around:
- STT request start
- LLM completion time
- TTS generation complete
Latency Targets for Natural Conversation
| Total Delay | Conversation Quality |
|---|---|
| < 300ms | Feels real-time |
| 300–600ms | Acceptable |
| 600–1000ms | Noticeable delay |
| > 1000ms | Breaks natural flow |
Optimization Checklist
- Use 20ms RTP frames
- Keep VoiceBridge close to Asterisk (same LAN if possible)
- Use streaming STT + streaming TTS
- Minimize model size for latency-critical use cases
- Disable unnecessary logging in production
- Avoid unnecessary transcoding
Cloud vs On-Prem Latency
Hosting VoiceBridge on-prem:
- Reduces RTP travel time
- Improves jitter stability
Cloud deployment:
- Adds network round-trip
- Requires careful region selection
Conclusion
AI voice quality is determined less by “model intelligence” and more by media engineering discipline.
VoiceBridge was built with RTP correctness and duplex timing control as first-class design principles — not afterthoughts.
Next recommended reading:
Want to see API-driven CRM + Telecom workflows in action? Try the WhatsApp bot or explore the demos.
Comments (0)
Be the first to comment.