VoiceBridge

Why Most “AI Voice for Asterisk” Solutions Fail at Full Duplex

MYLINEHUB Team • 2026-02-12 • 12 min

Most integrations break on duplex: RTP direction, timing drift, NAT, barge-in, and audio injection. Learn why—and how VoiceBridge avoids the traps.

Why Most “AI Voice for Asterisk” Solutions Fail at Full Duplex

Why Most “AI Voice for Asterisk” Solutions Fail at Full Duplex

Asterisk makes it look easy to build an “AI voice bot”: answer a call, record audio, run STT → LLM → TTS, then play audio back. In demos, this works. In production, it breaks — especially the moment you want true full duplex: the caller can speak while the bot is speaking, interruptions (“barge-in”) work, and the audio still stays stable under NAT, jitter, and real carrier RTP behavior.

This article explains the real reasons most integrations fail, and what a production-grade duplex bridge must do differently. We use the open-source reference implementation from MYLINEHUB VoiceBridge to show what “real” looks like: https://github.com/mylinehub/omnichannel-crm/tree/main/mylinehub-voicebridge.

If you want the full VoiceBridge architecture overview first, read: https://mylinehub.com/articles/mylinehub-voicebridge-architecture.

The “Demo Trap”: Why Full Duplex Looks Easy Until You Ship

The demo version of an AI call usually runs in a lab:

  • single server
  • no NAT (or predictable NAT)
  • clean LAN RTP
  • one call at a time
  • audio files or slow turn-based conversation

Full duplex dies in production because the system stops being “a script” and becomes a distributed real-time media pipeline:

  • RTP is continuous (every 20ms for G.711, often 10ms/20ms for PCM16 frames depending on processing)
  • two independent audio directions must remain stable (caller→bot, bot→caller)
  • the bot must stop talking the instant the caller interrupts (barge-in)
  • timing must be correct even under jitter and packet loss
  • ports, NAT, firewall policy, and session cleanup must work at scale

Most solutions fail because they solve “AI” but not “telecom media engineering”.

Failure #1: Turn-Based Architectures Pretend to Be “Voice AI”

The most common approach is still: record → transcribe → think → synthesize → play. This is not full duplex. It is an IVR with AI content.

Common examples:

  • AGI scripts
  • AMI-triggered audio playback flows
  • Dialplan macros that run external commands

These fail at full duplex because they are structurally blocking:

  • While Asterisk is playing your TTS file, your “bot” is not listening continuously.
  • While you are recording caller audio, the bot is not talking.
  • Interruptions become impossible (or extremely delayed), because audio is file-based and buffered.

VoiceBridge includes an explicit minimal AGI example mainly to show that AGI is not enough for duplex: MinimalAgiExample.java. True duplex requires RTP streaming + ARI media graph control, not turn-based blocking flows.

Failure #2: “We Receive RTP” Is Not the Same as “We Can Send RTP Back Correctly”

Many projects claim “full duplex” because they successfully receive RTP from Asterisk via ARI ExternalMedia. Receiving RTP is the easy half.

The hard half is sending RTP back to Asterisk in a way Asterisk will accept as real-time audio:

  • correct payload type for the negotiated codec
  • correct RTP timestamp increment per packet
  • correct sequence increment
  • correct pacing (20ms is not “sometimes 10ms then 80ms”)
  • stable SSRC behavior (or correct handling when SSRC changes)

Most “samples” inject audio like generic UDP data, with naive sleeps, or bursty timing. The result:

  • robotic audio
  • late packets discarded
  • drift (your timestamps run away from wall-clock time)
  • one direction works, other direction is silent

In VoiceBridge, the RTP correctness layer is explicit:

Failure #3: NAT and Firewall Reality Destroys “One UDP Port” Designs

Real deployments involve NAT, stateful firewalls, and asymmetric routing. Asterisk might send RTP from a different source port than you expect, especially through NAT. Many “AI voice” solutions hardcode RTP endpoints:

  • “Send audio to the IP/port from the ARI response.”
  • “Bind to a UDP port and hope the packets come back.”

Under symmetric NAT or strict firewall policy, the inbound RTP source may differ from the advertised tuple. This yields classic symptoms:

  • you can hear caller audio, but the caller cannot hear the bot
  • or audio works for 10–30 seconds then goes silent (NAT mapping changes)
  • or duplex works on LAN but fails on WAN

VoiceBridge explicitly handles this by learning and locking the symmetric RTP endpoint:

This is the difference between “works in a demo” and “works behind real routers”.

Failure #4: Incorrect ARI Media Graph (Bridge/Snoop/ExternalMedia Misuse)

Full duplex inside Asterisk is not “attach ExternalMedia and done”. The Asterisk media graph matters:

  • which channels are in which bridge
  • whether the bridge is mixing vs holding
  • whether you need snoop channels for read-only capture
  • how you avoid echo and talk/listen collisions

A common mistake is building a single bridge and assuming it will behave like a duplex media router. In practice, the most stable pattern is usually:

  • one leg dedicated to capture (caller → AI)
  • one leg dedicated to injection (AI → caller)
  • explicit lifecycle control during call start/stop

VoiceBridge’s ARI orchestration is implemented in:

If your ARI media graph is wrong, you will see “half duplex” behavior, echo-like artifacts, or silent legs.

Failure #5: Codec Handling and “Invisible” Transcoding Costs

Many teams start with “PCM16 everywhere” because it’s easiest for AI models. But the caller leg is usually telephony codecs:

  • PCMU/PCMA (G.711) for SIP trunks
  • Opus for WebRTC endpoints

The moment you mix:

  • PCMU inbound from Asterisk
  • PCM16 required by AI STT
  • PCM16 returned by AI TTS
  • PCMU required outbound back to Asterisk

…you are now doing real-time transcoding. Most “AI voice” scripts do it incorrectly or too slowly:

  • wrong sample rate assumptions
  • wrong frame boundaries (causes clicking / robotic sound)
  • buffering too much (latency explosion)
  • CPU spikes under concurrency

VoiceBridge has explicit codec and framing components:

If you do not treat codec conversion as a first-class engineering concern, duplex fails under load (or sounds broken).

Failure #6: No Real Barge-In (Interruptions) Control

Full duplex is not just “two audio directions”. It also means:

  • the bot may be speaking
  • the caller starts speaking
  • the bot must stop speaking instantly
  • and the AI must receive the caller speech without being contaminated by the bot audio

Most systems fail here because they:

  • play long TTS audio without a stop mechanism
  • buffer too much audio in advance
  • have no “truncate” semantics toward the AI streaming engine

VoiceBridge implements barge-in as a control loop:

Without this, you can’t deliver human-like conversational flow — you deliver “please wait for the bot to finish”.

Failure #7: No Session State Model (So Cleanup and Race Conditions Kill You)

Media systems are state machines. Calls end abruptly. Channels hang up. Bridges get destroyed. If your integration is “some scripts + UDP”, then over time you accumulate:

  • orphaned UDP sockets
  • port leaks (RTP ports never returned)
  • dangling ARI channels/bridges
  • threads blocked on I/O

Under concurrency, race conditions become common:

  • AI returns TTS after call already ended
  • you send RTP to a port now assigned to a different call
  • ARI sends an event after your script “thinks” it’s done

VoiceBridge treats the call as a first-class session object:

Failure #8: “AI Integration” That Ignores Real-Time Constraints

Many AI integrations treat speech like files:

  • buffer 5–10 seconds, then send for STT
  • wait for full LLM response
  • generate entire TTS audio first
  • then play back

That approach guarantees non-natural conversation. Full duplex requires streaming:

  • stream caller audio continuously to STT/AI
  • receive partial response quickly
  • start speaking while AI is still generating
  • be able to stop instantly on interruption

VoiceBridge’s real-time AI layer is explicitly separated and implemented as streaming clients:

What Actually Works: The Minimum Requirements for True Duplex

If you want full duplex that survives production, your solution needs all of the following:

  • Correct RTP packetization (headers + pacing + timestamps)
  • Two-leg media model where capture and injection are controlled reliably
  • Symmetric RTP endpoint learning for NAT correctness
  • Codec discipline (real conversion, not assumptions)
  • Streaming AI integration (not file-based turns)
  • Barge-in controller (detect speech, truncate bot audio, resume listening)
  • Session lifecycle state machine (cleanup, retries, failure handling)
  • Scalability primitives (port allocation, concurrency, backpressure)

This is why “AI calling” isn’t a single feature — it’s a system.

Why VoiceBridge Doesn’t Fall Into These Traps

VoiceBridge was built specifically around the failure modes above. It is not “ARI + UDP”. It is a real media bridge with explicit modules for RTP correctness, NAT safety, barge-in, codec handling, and event-driven ARI session control.

If you want to explore the implementation, start with these core files:

Practical Checklist: How to Spot a “Fake Duplex” Solution

If someone claims they have “full duplex AI voice for Asterisk”, ask these questions:

  • Can the caller interrupt the bot, and does the bot stop within < 200ms?
  • Do they send outbound RTP with correct timestamps and stable pacing, or bursts?
  • Do they handle symmetric NAT by learning the inbound RTP source endpoint?
  • Do they support real concurrency with clean port allocation and cleanup?
  • Do they have a session state model, or is it “scripts and sockets”?
  • Can they show Wireshark traces proving both RTP directions are continuous and stable?

If the answers are vague, the solution is not production duplex.

Conclusion

Most “AI voice for Asterisk” attempts fail at full duplex because they solve the AI part and ignore the telecom part. Real duplex requires engineering RTP correctness, NAT survival, bridge topology, codec conversion, barge-in control, and lifecycle state — all at once, under load.

The open-source path that addresses these realities directly is MYLINEHUB VoiceBridge. Explore the project here: https://github.com/mylinehub/omnichannel-crm/tree/main/mylinehub-voicebridge.

Try it

Want to see API-driven CRM + Telecom workflows in action? Try the WhatsApp bot or explore the demos.

💬 Try WhatsApp Bot ▶️ Watch CRM YouTube Demos
Tip: Comment “Try the bot” on our YouTube videos to see automation in action.
M
MYLINEHUB Team
Published: 2026-02-12
Quick feedback
Was this helpful? (Yes 0 • No 0)
Reaction

Comments (0)

Be the first to comment.