SIP vs SDP vs RTP: What Each Protocol Actually Does in a Call
Confused between SIP, SDP, and RTP? Learn what each one does, how they work together during a call, and why understanding the difference helps in VoIP design and troubleshooting.
🌐 Start with HTTP so the difference becomes obvious
Think about how a normal website works.
A browser sends an HTTP request to a server. The server sends an HTTP response back. That pattern is mostly about asking for resources and receiving data.
But a phone call or real-time voice session needs something more structured than “give me data”. A call needs:
• a way to start and control the session
• a way to describe how media should work
• a way to carry the actual media in real time
That is exactly where these three fit:
SIP → starts / manages / ends the call
SDP → describes how media should be exchanged
RTP → carries the actual audio/video packets
⚡ One-line difference before we go deeper
SIP is about the session conversation.
SDP is about the media agreement.
RTP is about the real-time media stream itself.
📞 What SIP actually does
SIP = Session Initiation Protocol.
SIP is a signaling protocol. Its job is not to carry the actual voice. Its job is to control the session.
SIP typically handles:
• finding or reaching the destination
• ringing the destination
• accepting or rejecting the call
• creating the dialog/session state
• putting the call on hold
• transferring the call
• ending the call
Common SIP methods:
| Method | Purpose | Simple meaning |
|---|---|---|
| INVITE | Start or modify a session | “Let’s begin a call.” |
| ACK | Confirm final response | “I got your answer.” |
| BYE | Terminate session | “The call is over.” |
| CANCEL | Cancel before answer | “Stop trying to connect.” |
| OPTIONS | Capability check | “What can you do?” |
| REGISTER | Register endpoint | “I’m here and reachable.” |
The key point is: SIP manages the conversation around the call, not the voice stream itself.
📄 What SDP actually does
SDP = Session Description Protocol.
In practice, think of SDP as: a text-based description of the media rules for the session.
SDP usually tells the other side:
• what media type exists: audio, video, application
• which codecs are supported
• which IP address to send media to
• which port to send media to
• whether the stream is sendrecv, sendonly, recvonly, or inactive
• in advanced systems, DTLS / ICE / fingerprint / BUNDLE / mux details
Important: SDP often travels inside SIP, but SDP is not SIP itself. It is the media description payload that SIP may carry.
Easy memory line: SIP says there will be a call. SDP explains how the media should work inside that call.
🎙️ What RTP actually does
RTP = Real-time Transport Protocol.
RTP carries the actual media stream. If two people are talking, RTP is the protocol that usually transports the encoded audio in real time.
RTP packets usually include:
• payload type
• sequence number
• timestamp
• SSRC
• media payload data
So when someone says “the call is connected but no audio is coming”, very often SIP worked, SDP partly worked, but RTP is not flowing correctly — or SDP described the RTP path incorrectly.
🆚 Quick comparison table
| Item | Main job | Human analogy | Carries voice? | Typical example |
|---|---|---|---|---|
| SIP | Signal and manage the session | The call-control conversation | No | INVITE, 180 Ringing, 200 OK, BYE |
| SDP | Describe media rules | The agreement sheet | No | Codec list, IP, port, direction |
| RTP | Carry real-time media packets | The actual sound/video stream | Yes | Voice frames every 20 ms |
📦 A real SIP message carrying SDP
This is the exact point where many people confuse SIP and SDP. Look closely: SIP is the outer message. SDP is the body inside it.
INVITE sip:1002@pbx.example.com SIP/2.0 Via: SIP/2.0/UDP 192.168.1.10:5060;branch=z9hG4bK-12345 From: "Alice" <sip:1001@pbx.example.com>;tag=abc123 To: <sip:1002@pbx.example.com> Call-ID: 9f8e7d6c@example.com CSeq: 1 INVITE Contact: <sip:1001@192.168.1.10:5060> Content-Type: application/sdp Content-Length: 245 v=0 o=- 3747 3747 IN IP4 192.168.1.10 s=VoIP Call c=IN IP4 192.168.1.10 t=0 0 m=audio 49170 RTP/AVP 0 8 101 a=rtpmap:0 PCMU/8000 a=rtpmap:8 PCMA/8000 a=rtpmap:101 telephone-event/8000 a=fmtp:101 0-16 a=sendrecv
INVITE / Via / From / To / Call-ID / CSeq = SIP
v=0 / c= / m=audio / a=rtpmap / a=sendrecv = SDP
🧾 What each one looks like structurally
| Item | Format style | Typical content | Human-readable? | Carries media? |
|---|---|---|---|---|
| SIP | Request / response message | INVITE, 200 OK, BYE, headers, routing info | Yes | No |
| SDP | Text session description body | codec list, port, IP, direction, ICE, fingerprint | Yes | No |
| RTP | Packet stream | timestamps, sequence numbers, SSRC, payload | Usually no, not in raw packet form | Yes |
🧠 What each one is responsible for
✅ SIP responsibilities
• locate or reach destination
• create session dialog
• ring / answer / cancel / terminate
• handle mid-call signaling like hold, transfer, update
✅ SDP responsibilities
• advertise supported codecs
• advertise media destination IP and port
• define media direction
• define media transport profile
• describe secure-media and ICE details in advanced systems
✅ RTP responsibilities
• carry actual encoded voice or video samples
• keep sequencing for real-time delivery
• provide timestamps for playback timing
• identify synchronization source
📍 Where each one sits in the stack
🎯 A simple real-world example
Imagine Alice calls Bob.
Step 1 — SIP: Alice’s phone sends an INVITE to Bob’s side. This is the “I want to start a call” part.
Step 2 — SDP: Inside that signaling exchange, Alice says: “I support Opus and PCMU. Send audio to this IP and port.”
Step 3 — SDP answer: Bob replies: “I accept PCMU. Send audio to my IP and port.”
Step 4 — RTP: Once negotiation is done, the actual audio packets start flowing. That is the real media stream.
So if the user says “the phone rang and connected, but I heard nothing,” the session part succeeded — but the media part did not.
🧩 How offer/answer makes SIP, SDP, and RTP work together
A very important deeper idea is that SDP usually works in offer/answer style.
| Phase | What happens | Example |
|---|---|---|
| Offer | One side describes what it can do | “I support Opus, PCMU, PCMA. Send to IP:port A.” |
| Answer | Other side accepts a compatible subset | “I accept PCMU. Send to IP:port B.” |
| RTP flow | Media begins using negotiated details | Audio packets start using the agreed codec/path |
That means RTP does not magically decide where to go. It follows the result of the negotiated session description.
🔢 Example: codec negotiation in plain language
Suppose Alice offers:
m=audio 49170 RTP/AVP 111 0 8 a=rtpmap:111 opus/48000/2 a=rtpmap:0 PCMU/8000 a=rtpmap:8 PCMA/8000
Bob only supports:
m=audio 52000 RTP/AVP 0 a=rtpmap:0 PCMU/8000
The result is usually PCMU. Not because PCMU is “better”, but because it is the codec that both sides can use.
🧪 RTP packet example: this is very different from SIP/SDP
SIP and SDP are text-heavy and human-readable. RTP is packet-oriented and built for real-time delivery.
RTP Header (conceptual) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Synchronization Source (SSRC) Identifier | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Media Payload ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Notice the difference: SIP and SDP help organize the session, while RTP is already at the per-packet media delivery level.
🧠 Why SIP success does not guarantee media success
This is one of the most important real-world lessons.
A SIP dialog can be completely successful:
• INVITE sent
• 180 Ringing received
• 200 OK received
• ACK sent
And yet the user may still hear: no audio, one-way audio, broken hold, bad codec behavior, or WebRTC connection issues.
Why? Because signaling success is not the same thing as media success. The media path still has to be negotiated correctly and then actually work.
🌍 Common failure examples
Example 1: Call rings, answers, but no audio
SIP worked. SDP may have advertised a private or unreachable IP address. RTP went to the wrong place.
Example 2: One-way audio
One side’s SDP answer may contain the wrong return address or blocked port. One RTP direction works, the other does not.
Example 3: Call connects but wrong codec is used
SIP is fine, but codec negotiation in SDP selected something unexpected, or transcoding logic changed behavior.
Example 4: Hold/resume behaves strangely
A later SDP exchange may have changed direction attributes to sendonly, recvonly, or inactive.
🔐 How this changes in WebRTC
In WebRTC, the same logic still applies:
• some signaling channel starts the session logic
• SDP describes codecs, ICE, DTLS fingerprints, media sections, directions
• SRTP carries the actual secure media
The big difference is that SIP may not be present at all. WebRTC can use a custom signaling method like WebSocket, HTTP API, Socket.IO, or another application channel.
Advanced but important lesson: SIP is one common signaling protocol, but SDP is not tied only to SIP. SDP can be exchanged through other signaling systems too.
⚠️ Common confusion points
❌ Mistake 1: “SIP carries audio”
No. SIP usually does not carry the audio stream. RTP or SRTP does.
❌ Mistake 2: “SDP is the same as SIP”
No. SDP is often inside SIP, but it is a different format with a different job.
❌ Mistake 3: “If SIP succeeds, media is guaranteed”
No. SIP can succeed while RTP fails because of bad SDP, NAT, firewall, blocked ports, or codec mismatch.
❌ Mistake 4: “RTP decides codecs”
No. Codec agreement is usually described in SDP first. RTP then carries the chosen codec payload.
❌ Mistake 5: “SDP is only for SIP phones”
No. WebRTC also depends heavily on SDP.
🧬 Beginner → advanced understanding ladder
Beginner level
SIP starts the call. SDP describes the media. RTP carries the voice.
Intermediate level
SIP messages such as INVITE and 200 OK often include SDP. That SDP decides codec, IP, port, and direction. Once both sides agree, RTP starts flowing on the negotiated path.
Advanced level
SIP is signaling. SDP is session/media description. RTP is real-time media transport. Many telecom failures happen because signaling succeeds while media negotiation or transport fails due to codec mismatch, NAT rewriting, incorrect advertised addresses, blocked RTP ports, wrong media direction, or secure-media negotiation issues.
Expert mindset
Never stop at “the SIP call connected”. Always ask: what was negotiated in SDP, which codec/path was actually selected, and did RTP really flow correctly in both directions?
❓ Quick FAQ
Is SIP more important than RTP?
They solve different problems. Without SIP-like signaling, sessions are hard to manage. Without RTP, there is no real-time media stream.
Can RTP exist without SIP?
Yes. RTP is a transport format for media. SIP is one signaling system, not the only one.
Can SDP exist without SIP?
Yes. WebRTC commonly exchanges SDP over custom signaling channels such as WebSocket or HTTP APIs.
Why do engineers inspect SDP so much?
Because codec selection, destination IP/port, direction, and secure-media details often explain why media succeeds or fails.
✅ Final takeaway
If you remember only one thing, remember this:
SIP is the conversation about the call.
SDP is the agreement about the media.
RTP is the real media stream itself.
Once you separate those three mentally, VoIP becomes much easier to design, debug, explain, and optimize. Many confusing telecom problems stop looking random and start looking traceable. 🎯
Want to see API-driven CRM + Telecom workflows in action? Try the WhatsApp bot or explore the demos.
Comments (0)
Be the first to comment.