VoiceBridge

VoiceBridge Architecture Deep Dive: Asterisk → RTP → AI → RTP → Asterisk

MYLINEHUB Team • 2026-02-08 • 14 min

A deep dive into the VoiceBridge media pipeline: capture RTP from Asterisk, process with AI, and inject RTP back—fully duplex and production-safe.

VoiceBridge Architecture Deep Dive: Asterisk → RTP → AI → RTP → Asterisk

This article is a true “under the hood” explanation of how MYLINEHUB VoiceBridge moves audio end-to-end in production: Asterisk/FreePBX captures caller audio, VoiceBridge streams it to an AI bot, then injects AI speech back into the same live call—while preserving real full-duplex behavior (barge-in, talk-over, cut-through) instead of turn-based IVR feel.

Source of truth for this write-up is the actual VoiceBridge code in the repo: GitHub: mylinehub-voicebridge. For the high-level overview, see: MYLINEHUB VoiceBridge Architecture.

The core problem: Asterisk is not an AI media engine

Asterisk can bridge channels, record calls, run IVRs, and expose ARI for control—but it does not natively provide a clean, bidirectional, low-latency “raw audio duplex API” suitable for AI real-time voice. Most demo integrations fail because they:

Lose one direction of audio under NAT/firewall conditions.
Send RTP back with wrong timing (timestamp drift, bad pacing, payload mismatch).
Cannot handle talk-over (barge-in) cleanly because audio injection is not managed as a real-time stream.
Do not pin and control two independent RTP legs (caller→AI and AI→caller) as first-class streams.

VoiceBridge solves this by building a deliberate media graph in ARI and running a strict RTP engine beside Asterisk.

High-level architecture and the real media pipeline

The pipeline is not “one ExternalMedia channel” and it’s not “just record + playback”. VoiceBridge creates a proper duplex graph:

Inbound capture leg: caller audio is tapped and sent out to VoiceBridge RTP input.
Outbound injection leg: VoiceBridge sends RTP back into Asterisk which is bridged to the caller.

The important detail: VoiceBridge treats inbound and outbound as two separate RTP streams with their own ports, timing, endpoints, and health signals. That’s the foundation for stable duplex.

The ARI media graph: two bridges, snoop inbound, two ExternalMedia channels

The graph assembly happens in Java inside: ari/impl/AriBridgeImpl.java and the ARI REST calls are executed by: ari/impl/ExternalMediaManagerImpl.java.

The pattern used is intentionally “dual-bridge duplex”:

Talk Bridge (mixing): caller channel is bridged with ExternalMedia OUT (the channel that will play AI audio back to the caller).
Tap Bridge (mixing): caller audio is captured via Snoop Inbound and bridged with ExternalMedia IN (the channel Asterisk uses to send media out).

You can see this explicitly in AriBridgeImpl where it creates two bridges (both “mixing”), creates one snoop inbound channel, and two ExternalMedia channels—then adds them to the appropriate bridges. This is the mechanical core of “Asterisk → RTP → AI → RTP → Asterisk”.

Why two ExternalMedia channels instead of one?

In real deployments, trying to do everything through one ExternalMedia channel tends to collapse into one-way audio or timing chaos because:

Asterisk’s media directionality and RTP peer learning behave differently on each leg.
Under NAT, the “correct” remote RTP endpoint may only become known after packets flow.
Inbound capture and outbound injection have different pacing responsibilities.

VoiceBridge uses separate ports (inbound port vs outbound port) and treats them as separate engines. The port allocation is handled via: rtp/RtpPortAllocator.java, and per-call state is attached to the call session object in: session/CallSession.java.

RTP engine fundamentals: packetizer, timestamps, payload types

Sending audio back to Asterisk is not “send UDP bytes”. It must be valid RTP with correct: sequence, timestamp, SSRC, payload type, and pacing (20ms frames, stable clock increments, no bursty scheduling).

VoiceBridge’s RTP framing logic lives in: rtp/RtpPacketizer.java. This is where the system ensures it emits correct RTP packets rather than “best effort UDP”.

Production note: even if your audio bytes are correct (PCMU/PCM16), RTP that is badly timed will sound like: choppy speech, robotic jitter, random dropouts, or “it plays for 2 seconds then stops”. That’s almost always pacing and timestamp discipline.

NAT-safe behavior: symmetric RTP learning (why one-way audio happens)

One of the most common failure modes in real installations is: Asterisk sends RTP to VoiceBridge, but does not accept RTP back (or accepts it intermittently). The root cause is typically NAT and endpoint mismatch: the RTP “peer” you think you should send to is not the endpoint Asterisk ends up using once the call is live.

VoiceBridge addresses this with symmetric RTP endpoint learning: rtp/RtpSymmetricEndpoint.java. Conceptually:

Listen on the allocated UDP port.
Observe the actual source IP:port of inbound RTP packets.
“Lock” the learned remote endpoint for outbound RTP replies (or keep it adaptive if your policy requires).

This single behavior is often the difference between “demo works on LAN” and “production works across NAT and firewalls”.

How VoiceBridge discovers the Asterisk RTP peer for ExternalMedia

With ARI ExternalMedia, Asterisk exposes variables like UNICASTRTP_LOCAL_ADDRESS and UNICASTRTP_LOCAL_PORT (and related fields) once the channel is live. VoiceBridge actively fetches this peer information after creating the ExternalMedia channel.

That “who exactly is Asterisk on the RTP side” discovery is part of the ExternalMedia manager layer: ExternalMediaManagerImpl.java and used during graph readiness in: AriBridgeImpl.java.

Why it matters: in complex networks, “the host you called ARI with” is not always the same interface Asterisk binds for RTP. If you send RTP to the wrong interface or wrong port, you get silence (classic one-way audio symptom).

Session management and lifecycle: call state, cleanup, and shutdown safety

A production duplex system must be able to stop safely: close sockets, flush audio queues, finalize recordings, and tear down ARI bridges—even if the call drops unexpectedly.

VoiceBridge tracks everything per call using: session/CallSession.java and orchestrates call lifecycle with: session/CallSessionManager.java.

On StasisEnd the system shuts down:

Playout scheduling (queue/PlayoutScheduler.java)
RTP endpoints (dual: inbound + outbound) via RtpSymmetricEndpoint
DSP resources (echo/VAD helpers) via services like service/DspService.java
Recording writers (if enabled) such as recording/StereoWavFileWriter.java

This lifecycle discipline is one of the biggest differences between “sample code” and a production system.

AI streaming integration: realtime client, truncation, and cut-through

VoiceBridge is designed to support real-time AI sessions (streaming STT/LLM/TTS) while preserving duplex. The AI side is abstracted behind factories and clients like:

Realtime AI client implementation: ai/impl/RealtimeAiClientImpl.java
Truncation/cut-through manager (barge-in control): ai/impl/OpenAiRealtimeTruncateManager.java

“Truncation” here is the crucial real-time behavior: when the caller starts speaking while the bot is talking, the system must stop (or cut down) bot audio output immediately. If you don’t do this, duplex exists technically but the user experience feels broken (the caller must “wait for the bot to finish talking”).

VoiceBridge does not treat barge-in as a UI feature—it treats it as a first-class media control requirement.

DB-driven runtime configuration: why production needs it

In production, you cannot hardcode RTP settings, AI configuration, recording mode, queue behavior, or retry thresholds. VoiceBridge loads these at runtime from the database, with the central entity: models/StasisAppConfig.java.

This model includes fields for:

ARI connection parameters (host/port/app/user/pass)
Codec + sample rate expectations
Recording enablement and storage mode
Queue watermarks and pacing thresholds
Resilience settings (retry/circuit-breaker thresholds)
AI provider selection and realtime parameters

Fetching config per call is done through a repository/service pattern such as: repository/StasisAppConfigRepository.java and service/StasisAppConfigService.java.

What makes the pipeline “production-safe” (not just working once)

“Working” in a lab is easy. Production-safe duplex requires guardrails:

RTP pacing discipline (no burst writes, stable 20ms cadence).
NAT endpoint learning (send replies to the real RTP peer).
Codec correctness (payload type and actual audio format match end-to-end).
Queue control (prevent bot playout queue from growing until conversation becomes delayed).
Clean teardown (StasisEnd closes endpoints, stops schedulers, finalizes recordings).

VoiceBridge encodes these behaviors in the actual components we referenced above rather than relying on fragile dialplan tricks.

Deployment notes: ports, firewall, and why you must plan RTP explicitly

Duplex means at least two RTP ports per call (often more if you add monitoring/recording legs). VoiceBridge allocates and uses RTP ports explicitly via: RtpPortAllocator.

At minimum, you must ensure:

Asterisk can reach VoiceBridge RTP port range (UDP).
VoiceBridge can send RTP back to Asterisk RTP peer (UDP).
ARI HTTP/WebSocket access is allowed only from trusted networks (not public internet).

The service-level bind is configured in: src/main/resources/application.properties (example: rtp.bind.port).

MYLINEHUB VoiceBridge Architecture (big picture)

If you want to validate your deployment, read those first—then come back to this deep dive to map symptoms to architecture.

Summary: the “Asterisk → RTP → AI → RTP → Asterisk” contract

VoiceBridge is not “an AI bot inside Asterisk”. It is a dedicated duplex RTP engine that:

Builds a correct ARI media graph (dual bridges + snoop + dual ExternalMedia).
Runs strict RTP packetization and timing to inject audio reliably.
Uses symmetric endpoint learning so NAT does not break duplex audio.
Manages AI streaming and truncation so talk-over behaves like real conversation.
Keeps everything configurable through DB-driven runtime settings.

That’s the practical difference between “a demo” and “production full-duplex voice”.

Try it

Want to see API-driven CRM + Telecom workflows in action? Try the WhatsApp bot or explore the demos.

💬 Try WhatsApp Bot ▶️ Watch CRM YouTube Demos

Tip: Comment “Try the bot” on our YouTube videos to see automation in action.

MYLINEHUB Team

Published: 2026-02-08

Quick feedback

Was this helpful? (Yes 0 • No 0)

Reaction

Comments (0)

Be the first to comment.

VoiceBridge Architecture Deep Dive: Asterisk → RTP → AI → RTP → Asterisk

VoiceBridge Architecture Deep Dive: Asterisk → RTP → AI → RTP → Asterisk

The core problem: Asterisk is not an AI media engine

High-level architecture and the real media pipeline

The ARI media graph: two bridges, snoop inbound, two ExternalMedia channels

Why two ExternalMedia channels instead of one?

RTP engine fundamentals: packetizer, timestamps, payload types

NAT-safe behavior: symmetric RTP learning (why one-way audio happens)

How VoiceBridge discovers the Asterisk RTP peer for ExternalMedia

Session management and lifecycle: call state, cleanup, and shutdown safety

AI streaming integration: realtime client, truncation, and cut-through

DB-driven runtime configuration: why production needs it

What makes the pipeline “production-safe” (not just working once)

Deployment notes: ports, firewall, and why you must plan RTP explicitly

Cross-link to related VoiceBridge architecture

Summary: the “Asterisk → RTP → AI → RTP → Asterisk” contract

Comments (0)