Why Most “AI Voice for Asterisk” Solutions Fail at Full Duplex
Most integrations break on duplex: RTP direction, timing drift, NAT, barge-in, and audio injection. Learn why—and how VoiceBridge avoids the traps.
Why Most “AI Voice for Asterisk” Solutions Fail at Full Duplex
Asterisk makes it look easy to build an “AI voice bot”: answer a call, record audio, run STT → LLM → TTS, then play audio back. In demos, this works. In production, it breaks — especially the moment you want true full duplex: the caller can speak while the bot is speaking, interruptions (“barge-in”) work, and the audio still stays stable under NAT, jitter, and real carrier RTP behavior.
This article explains the real reasons most integrations fail, and what a production-grade duplex bridge must do differently. We use the open-source reference implementation from MYLINEHUB VoiceBridge to show what “real” looks like: https://github.com/mylinehub/omnichannel-crm/tree/main/mylinehub-voicebridge.
If you want the full VoiceBridge architecture overview first, read: https://mylinehub.com/articles/mylinehub-voicebridge-architecture.
The “Demo Trap”: Why Full Duplex Looks Easy Until You Ship
The demo version of an AI call usually runs in a lab:
- single server
- no NAT (or predictable NAT)
- clean LAN RTP
- one call at a time
- audio files or slow turn-based conversation
Full duplex dies in production because the system stops being “a script” and becomes a distributed real-time media pipeline:
- RTP is continuous (every 20ms for G.711, often 10ms/20ms for PCM16 frames depending on processing)
- two independent audio directions must remain stable (caller→bot, bot→caller)
- the bot must stop talking the instant the caller interrupts (barge-in)
- timing must be correct even under jitter and packet loss
- ports, NAT, firewall policy, and session cleanup must work at scale
Most solutions fail because they solve “AI” but not “telecom media engineering”.
Failure #1: Turn-Based Architectures Pretend to Be “Voice AI”
The most common approach is still: record → transcribe → think → synthesize → play. This is not full duplex. It is an IVR with AI content.
Common examples:
- AGI scripts
- AMI-triggered audio playback flows
- Dialplan macros that run external commands
These fail at full duplex because they are structurally blocking:
- While Asterisk is playing your TTS file, your “bot” is not listening continuously.
- While you are recording caller audio, the bot is not talking.
- Interruptions become impossible (or extremely delayed), because audio is file-based and buffered.
VoiceBridge includes an explicit minimal AGI example mainly to show that AGI is not enough for duplex: MinimalAgiExample.java. True duplex requires RTP streaming + ARI media graph control, not turn-based blocking flows.
Failure #2: “We Receive RTP” Is Not the Same as “We Can Send RTP Back Correctly”
Many projects claim “full duplex” because they successfully receive RTP from Asterisk via ARI ExternalMedia. Receiving RTP is the easy half.
The hard half is sending RTP back to Asterisk in a way Asterisk will accept as real-time audio:
- correct payload type for the negotiated codec
- correct RTP timestamp increment per packet
- correct sequence increment
- correct pacing (20ms is not “sometimes 10ms then 80ms”)
- stable SSRC behavior (or correct handling when SSRC changes)
Most “samples” inject audio like generic UDP data, with naive sleeps, or bursty timing. The result:
- robotic audio
- late packets discarded
- drift (your timestamps run away from wall-clock time)
- one direction works, other direction is silent
In VoiceBridge, the RTP correctness layer is explicit:
- rtp/RtpPacketizer.java — constructs RTP packets with correct headers (sequence, timestamp, payload type, SSRC discipline).
- queue/PlayoutScheduler.java — schedules outbound audio so packets are paced like real telephony, not burst output.
- dsp/Pcm10msFramer.java — frames audio into consistent durations so timing math stays stable.
Failure #3: NAT and Firewall Reality Destroys “One UDP Port” Designs
Real deployments involve NAT, stateful firewalls, and asymmetric routing. Asterisk might send RTP from a different source port than you expect, especially through NAT. Many “AI voice” solutions hardcode RTP endpoints:
- “Send audio to the IP/port from the ARI response.”
- “Bind to a UDP port and hope the packets come back.”
Under symmetric NAT or strict firewall policy, the inbound RTP source may differ from the advertised tuple. This yields classic symptoms:
- you can hear caller audio, but the caller cannot hear the bot
- or audio works for 10–30 seconds then goes silent (NAT mapping changes)
- or duplex works on LAN but fails on WAN
VoiceBridge explicitly handles this by learning and locking the symmetric RTP endpoint:
- rtp/RtpSymmetricEndpoint.java — tracks the actual (IP, port) observed on inbound RTP and uses it for outbound RTP safely.
This is the difference between “works in a demo” and “works behind real routers”.
Failure #4: Incorrect ARI Media Graph (Bridge/Snoop/ExternalMedia Misuse)
Full duplex inside Asterisk is not “attach ExternalMedia and done”. The Asterisk media graph matters:
- which channels are in which bridge
- whether the bridge is mixing vs holding
- whether you need snoop channels for read-only capture
- how you avoid echo and talk/listen collisions
A common mistake is building a single bridge and assuming it will behave like a duplex media router. In practice, the most stable pattern is usually:
- one leg dedicated to capture (caller → AI)
- one leg dedicated to injection (AI → caller)
- explicit lifecycle control during call start/stop
VoiceBridge’s ARI orchestration is implemented in:
- ari/impl/AriBridgeImpl.java — constructs and manages the bridge topology required for duplex stability.
- ari/impl/ExternalMediaManagerImpl.java — creates ExternalMedia channels with the correct parameters, and ensures they are attached correctly.
- ari/AriWsClient.java — event-driven ARI control is required; polling-only approaches miss lifecycle timing.
If your ARI media graph is wrong, you will see “half duplex” behavior, echo-like artifacts, or silent legs.
Failure #5: Codec Handling and “Invisible” Transcoding Costs
Many teams start with “PCM16 everywhere” because it’s easiest for AI models. But the caller leg is usually telephony codecs:
- PCMU/PCMA (G.711) for SIP trunks
- Opus for WebRTC endpoints
The moment you mix:
- PCMU inbound from Asterisk
- PCM16 required by AI STT
- PCM16 returned by AI TTS
- PCMU required outbound back to Asterisk
…you are now doing real-time transcoding. Most “AI voice” scripts do it incorrectly or too slowly:
- wrong sample rate assumptions
- wrong frame boundaries (causes clicking / robotic sound)
- buffering too much (latency explosion)
- CPU spikes under concurrency
VoiceBridge has explicit codec and framing components:
- audio/CodecFactory.java — chooses codec pipelines for a session.
- audio/codec/PcMuCodec.java, audio/codec/PcMaCodec.java, audio/codec/OpusCodec.java — real codec implementations.
- audio/resampler/FfmpegResampler.java — pragmatic resampling/transcoding when needed (with operational tradeoffs).
If you do not treat codec conversion as a first-class engineering concern, duplex fails under load (or sounds broken).
Failure #6: No Real Barge-In (Interruptions) Control
Full duplex is not just “two audio directions”. It also means:
- the bot may be speaking
- the caller starts speaking
- the bot must stop speaking instantly
- and the AI must receive the caller speech without being contaminated by the bot audio
Most systems fail here because they:
- play long TTS audio without a stop mechanism
- buffer too much audio in advance
- have no “truncate” semantics toward the AI streaming engine
VoiceBridge implements barge-in as a control loop:
- barge/BargeInController.java — decides when caller speech energy should cut through.
- barge/AudioEnergy.java — tracks audio energy / thresholds for interruption detection.
- ai/impl/OpenAiRealtimeTruncateManager.java — sends truncation signals to stop AI audio mid-stream cleanly.
Without this, you can’t deliver human-like conversational flow — you deliver “please wait for the bot to finish”.
Failure #7: No Session State Model (So Cleanup and Race Conditions Kill You)
Media systems are state machines. Calls end abruptly. Channels hang up. Bridges get destroyed. If your integration is “some scripts + UDP”, then over time you accumulate:
- orphaned UDP sockets
- port leaks (RTP ports never returned)
- dangling ARI channels/bridges
- threads blocked on I/O
Under concurrency, race conditions become common:
- AI returns TTS after call already ended
- you send RTP to a port now assigned to a different call
- ARI sends an event after your script “thinks” it’s done
VoiceBridge treats the call as a first-class session object:
- session/CallSession.java — holds the per-call state (RTP endpoints, codec, AI config, bridges, lifecycle flags).
- session/CallSessionManager.java — creates, tracks, and reliably cleans sessions on hangup/failure.
- rtp/RtpPortAllocator.java — manages RTP ports safely so concurrency does not corrupt calls.
Failure #8: “AI Integration” That Ignores Real-Time Constraints
Many AI integrations treat speech like files:
- buffer 5–10 seconds, then send for STT
- wait for full LLM response
- generate entire TTS audio first
- then play back
That approach guarantees non-natural conversation. Full duplex requires streaming:
- stream caller audio continuously to STT/AI
- receive partial response quickly
- start speaking while AI is still generating
- be able to stop instantly on interruption
VoiceBridge’s real-time AI layer is explicitly separated and implemented as streaming clients:
- ai/RealtimeAiClient.java — contract for streaming interaction.
- ai/impl/RealtimeAiClientImpl.java — production logic for streaming send/receive.
- ai/AiClientFactory.java — per-session AI selection/config.
What Actually Works: The Minimum Requirements for True Duplex
If you want full duplex that survives production, your solution needs all of the following:
- Correct RTP packetization (headers + pacing + timestamps)
- Two-leg media model where capture and injection are controlled reliably
- Symmetric RTP endpoint learning for NAT correctness
- Codec discipline (real conversion, not assumptions)
- Streaming AI integration (not file-based turns)
- Barge-in controller (detect speech, truncate bot audio, resume listening)
- Session lifecycle state machine (cleanup, retries, failure handling)
- Scalability primitives (port allocation, concurrency, backpressure)
This is why “AI calling” isn’t a single feature — it’s a system.
Why VoiceBridge Doesn’t Fall Into These Traps
VoiceBridge was built specifically around the failure modes above. It is not “ARI + UDP”. It is a real media bridge with explicit modules for RTP correctness, NAT safety, barge-in, codec handling, and event-driven ARI session control.
If you want to explore the implementation, start with these core files:
- ARI + bridges: AriBridgeImpl.java, ExternalMediaManagerImpl.java
- RTP correctness + NAT: RtpPacketizer.java, RtpSymmetricEndpoint.java, RtpPortAllocator.java
- Session lifecycle: CallSession.java, CallSessionManager.java
- Barge-in + truncation: BargeInController.java, OpenAiRealtimeTruncateManager.java
Practical Checklist: How to Spot a “Fake Duplex” Solution
If someone claims they have “full duplex AI voice for Asterisk”, ask these questions:
- Can the caller interrupt the bot, and does the bot stop within < 200ms?
- Do they send outbound RTP with correct timestamps and stable pacing, or bursts?
- Do they handle symmetric NAT by learning the inbound RTP source endpoint?
- Do they support real concurrency with clean port allocation and cleanup?
- Do they have a session state model, or is it “scripts and sockets”?
- Can they show Wireshark traces proving both RTP directions are continuous and stable?
If the answers are vague, the solution is not production duplex.
Conclusion
Most “AI voice for Asterisk” attempts fail at full duplex because they solve the AI part and ignore the telecom part. Real duplex requires engineering RTP correctness, NAT survival, bridge topology, codec conversion, barge-in control, and lifecycle state — all at once, under load.
The open-source path that addresses these realities directly is MYLINEHUB VoiceBridge. Explore the project here: https://github.com/mylinehub/omnichannel-crm/tree/main/mylinehub-voicebridge.
Want to see API-driven CRM + Telecom workflows in action? Try the WhatsApp bot or explore the demos.
Comments (0)
Be the first to comment.