VOIP Protocols and Standards
VOIP Protocols and Standards
VOIP Protocols and Standards
I. Introduction
The rapid evolution of VoIP was made possible in part by the use of protocols and standards, or special
sets of rules that end points in a telecommunication connection use when they communicate. Standard
bodies are responsible for writing the rules that keep the lines of communication wide open. The goals
of standards organizations are centered primarily on developing, amending, revising and updating
standards to foster the ubiquity of a technology. In the case of VoIP, vendors, architects and developers,
researchers, telecom providers, and users rely on their comprehensive expertise and experience to bring
about successful and secure VoIP adoptions.
II. The Foundation: TCP/IP
TCP/IP (Transmission Control Protocol/Internet Protocol) is the basic communication language or
protocol of the Internet. It can also be used as a communications protocol in a private network (either
an intranet or an extranet). When you are set up with direct access to the Internet, your computer is
provided with a copy of the TCP/IP program just as every other computer that you may send messages
to or get information from also has a copy of TCP/IP. Two protocols are also necessary for VoIP service: a
signaling protocol and a speech transmission protocol.
TCP/IP provides end-to-end connectivity specifying how data should be formatted, addressed,
transmitted, routed and received at the destination. This functionality has been organized into four
abstraction layers which are used to sort all related protocols according to the scope of networking
involved. From lowest to highest, the layers are the link layer, containing communication technologies
for a single network segment (link), the internet layer, connecting independent networks, thus
establishing internetworking, the transport layer handling process-to-process communication, and
the application layer, which interfaces to the user and provides support services.
III. Signaling Protocols
Call signaling is used in Voice over IP (VoIP ) systems to establish connections between endpoints, or
between an endpoint and a gatekeeper. The most commonly used VoIP signaling protocols are as
follows:
QSIG
Session Initiation Protocol
H.323
H.225.0
H.248
Media Gateway Control Protocol
Megaco
Signaling System No. 5
Signaling System No. 7
Dual-tone multi-frequency signaling
R1
R2 signaling
NBAP (Node B Application Part)
SCCP (Skinny Call Control Protocol)
Jingle
Q.931
IV. Speech Transmission Protocols
UDP (User datagram protocol)
RTP (Real-Time Transport Protocol)
TCP (Transmission Control Protocol)
ITU-T Standard
H.323
This is the ITU-Ts (International Telecommunications Union) standard that vendors should comply while
providing Voice over IP service. This recommendation provides the technical requirements for voice
communication over LANs while assuming that no Quality of Service (QoS) is being provided by LANs. It
was originally developed for multimedia conferencing on LANs, but was later extended to cover Voice
over IP. The first version was released in 1996 while the second version of H.323 came into effect in
January 1998. The standard encompasses both point to point communications and multipoint
conferences. The products and applications of different vendors can interoperate if they abide by the
H.323 specification.
Components of H.323
H.323 defines four logical components viz., Terminals, Gateways, Gatekeepers and Multipoint Control
Units (MCUs). Terminals, gateways and MCUs are known as endpoints.
1. Terminals
These are the LAN client endpoints that provide real time, two way communications. All H.323 terminals
have to supportH.245, Q.931, Registration Admission Status (RAS) and Real Time Transport Protocol
(RTP). H.245 is used for allowing the usage of the channels, Q.931 is required for call signaling and
setting up the call, RTP is the real time transport protocol that carries voice packets while RAS is used for
interacting with the gatekeeper. These protocols have been discussed later in the paper. H.323
terminals may also include T.120 data conferencing protocols, video codecs and support for MCU. A
H.323 terminal can communicate with either another H.323 terminal, a H.323 gateway or a MCU.
2. Gateways
An H.323 gateway is an endpoint on the network which provides for real-time, two-way
communications between H.323 terminals on the IP network and other ITU terminals on a switched
based network, or to another H.323 gateway. They perform the function of a "translator" i.e. they
perform the translation between different transmission formats, e.g. from H.225 to H.221. They are also
capable of translating between audio and video codecs. The gateway is the interface between the PSTN
and the Internet. They take voice from circuit switched PSTN and place it on the public Internet and vice
versa. Gateways are optional in that terminals in a single LAN can communicate with each other directly.
When the terminals on a network need to communicate with an endpoint in some other network, then
they communicate via gateways using the H.245 and Q.931 protocols.
3. Gatekeepers
It is the most vital component of the H.323 system and dispatches the duties of a "manager". It acts as
the central point for all calls within its zone (A zone is the aggregation of the gatekeeper and the
endpoints registered with it) and provides services to the registered endpoints. Some of the
functionalities that gatekeepers provide are listed below
Address Translation: Translation of an alias address to the transport address. This is done using the
translation table which is updated using the Registration messages. Admissions Control : Gatekeepers
can either grant or deny access based on call authorization, source and destination addresses or some
other criteria. Call signaling : The Gatekeeper may choose to complete the call signaling with the
endpoints and may process the call signaling itself. Alternatively, the Gatekeeper may direct the
endpoints to connect the Call Signaling Channel directly to each other.
Call Authorization: The Gatekeeper may reject calls from a terminal due to authorization failure through
the use of H.225 signaling. The reasons for rejection could be restricted access during some time periods
or restricted access to/from particular terminals or Gateways.
Bandwidth Management: Control of the number of H.323 terminals permitted simultaneously access to
the network. Through the use of H.225 signaling, the Gatekeeper may reject calls from a terminal due to
bandwidth limitations.
Call Management: The gatekeeper may maintain a list of ongoing H.323 calls. This information may be
necessary to indicate that a called terminal is busy, and to provide information for the Bandwidth
Management function.
4. Multipoint Control Units (MCU)
The MCU is an endpoint on the network that provides the capability for three or more terminals
and gateways to participate in a multipoint conference. The MCU consists of a mandatory
Multipoint Controller (MC) and optional Multipoint Processors (MP). The MC determines the
common capabilities of the terminals by using H.245 but it does not perform the multiplexing of
audio, video and data. The multiplexing of media streams is handled by the MP under the
control of the MC.
H.323 Protocol Stack
The following figure [Fig 2] shows the H.323 protocol stack. The audio, video and registration packets
use the unreliable User Datagram Protocol (UDP) while the data and control application packets use the
reliable Transmission Control Protocol (TCP) as the transport protocol. Except for the T.120 protocol, the
other protocols are described in the paper. The T.120 protocol is used for defining the data conferencing
part.
IETF Standard
SESSION INITIATION PROTOCOL (SIP)
This is the IETFs standard for establishing VOIP connections. It is an application layer control protocol
for creating, modifying and terminating sessions with one or more participants. The architecture of SIP is
similar to that of HTTP (client-server protocol). Requests are generated by the client and sent to the
server. The server processes the requests and then sends a response to the client. A request and the
responses for that request make a transaction. SIP has INVITE and ACK messages which define the
process of opening a reliable channel over which call control messages may be passed. SIP makes
minimal assumptions about the underlying transport protocol. This protocol itself provides reliability
and does not depend on TCP for reliability. SIP depends on the Session Description Protocol (SDP) for
carrying out the negotiation for codec identification. SIP supports session descriptions that allow
participants to agree on a set of compatible media types. It also supports user mobility by
proxying and redirecting requests to the users current location. The services that SIP provide
include
User Location: determination of the end system to be used for communication
Call Setup: ringing and establishing call parameters at both called and calling party
User Availability: determination of the willingness of the called party to engage in
communications
User Capabilities: determination of the media and media parameters to be used Call
handling: the transfer and termination of calls
Components of SIP
The SIP System consists of two components:
User Agents:
A user agent is an end system acting on behalf of a user. There are two parts to it: a client and a server.
The client portion is called the User Agent Client (UAC) while the server portion is called User Agent
Server (UAS). The UAC is used to initiate a SIP request while the UAS is used to receive requests and
return responses on behalf of the user.
Network Servers:
There are 3 types of servers within a network. A registration server receives updates concerning the
current locations of users. A proxy server on receiving requests forwards them to the next-hop server,
which has more information about the location of the called party. A redirect server on receiving
requests determines the next-hop server and returns the address of the next-hop server to the client
instead of forwarding the request.
SIP Messages
SIP defines a lot of messages. These messages are used for communicating between the client and the
SIP server. These messages are:
INVITE: for inviting a user to a call
BYE: for terminating a connection between the two end points
ACK: for reliable exchange of invitation messages
OPTIONS: for getting information about the capabilities of a call
REGISTER: gives information about the location of a user to the SIP registration server.
CANCEL: for terminating the search for a user
Voice Over IP : Protocols and Standards
Overview of SIP operation
Callers and callees are identified by SIP addresses. When making a SIP call, a caller first needs to locate
the appropriate server and send it a request. The caller can either directly reach the callee or indirectly
through the redirect servers. The Call ID field in the SIP message header uniquely identifies the calls.
Below I briefly discuss how the protocol performs its operations
SIP Addressing
The SIP hosts are identified by a SIP URL which is of the form sip:username@host. A SIP address can
either designate an individual or a whole group.
Locating a SIP server
The client can either send the request to a SIP proxy server or it can send it directly to the IP address and
port corresponding to the Uniform Request Identifier (URI).
SIP Transaction
Once the host part of the Request URI has been resolved to a SIP server, the client can send requests to
that server. A request together with the responses triggered by that request make up a SIP transaction.
The requests can be sent through reliable TCP or through unreliable UDP.
SIP Invitation
A successful SIP invitation consists of two requests: a INVITE followed by ACK. The INVITE request asks
the callee to join a particular conference or establish a two party conversation. After the callee has
agreed to participate in the call, the caller confirms that it has received that response by sending an ACK
request. The INVITE request contains a session description that provides the called party with enough
information to join the session. If the callee wishes to accept the call, it responds to the invitation by
returning a similar session description.
Locating a User
A callee may keep changing its position with time. These locations can be dynamically registered with
the SIP server. When the SIP server is queried about the location of a callee, it returns a list of possible
locations. A Location Server in the SIP system actually generates the list and passes it to the SIP server.
Changing an Existing Session
Sometimes we may need to change the parameters of an existing session. This is done by re-issuing the
INVITE message using the same Call ID but a new body to convey the new information.
Sample SIP Operation
Here a basic example of a SIP operation is given where a client is inviting a participant for a call. A SIP
client creates an INVITE message for arora.32@osu.edu., which is normally sent to a proxy server. This
proxy server tries to obtain the IP address of the SIP server that handles requests for the
requested domain. The proxy server consults a Location Server to
determine this next hop server. The Location server is a non-SIP server that stores information
about the next hop servers Voice Over IP : Protocols and Standards for different users. On
getting the IP address of the next hop server, the proxy server forwards the INVITE to the next
hop server. After the User Agent Server (UAS) has been reached, it sends a response back to the
proxy server. The proxy server in-turn sends back a response to the client. The client then
confirms that it has received the response by sending an ACK. The exchange of messages is
shown in the figure below In this case; we had assumed that the client's INVITE request was
forwarded to the proxy server. However, if it had been forwarded to a redirect server, then the
redirect server returns the IP address of the next hop server to the client. The client then
directly communicates with the UAS.
COMPARISON OF H.323 WITH SIP
The proponents of SIP claim that since H.323 was designed with ATM and ISDN signaling in mind, so
H.323 is not well suited for controlling the voice over IP systems. They say that H.323 is inherently
complex, has overheads and thus inefficient for VOIP. They also claim that H.323 lacks the extensibility
required of the signaling protocol for VOIP. As SIP has been designed by keeping the Internet in mind, so
it avoids both the complexity and extensibility pitfalls. SIP reuses most of the header fields, encoding
rules, error codes and authentication mechanisms of HTTP. H.323 defines hundreds of elements while
SIP has only 37 headers, each with a small number of values and parameters. H.323 uses a binary
representation for its messages, which is based on ASN.1 while SIP encodes its messages as text, similar
to HTTP. H.323 is not very scalable as it was designed for use on a single LAN and so has some problems
in scaling though newer versions have suggested techniques to get around the problem. H.323 is still
limited when performing loop detection in complex multi-domain searches. It can be done statefully by
storing messages but this technique is not very scalable. On the other hand, SIP uses a loop detection
method by checking the history of the message in the header fields, which can be done in a stateless
manner. The advantage of SIP is that it is backed up by IETF, one of the most important standard bodies
while the advantage of H.323 is that it has a much larger chunk of the market presently.
V. SUPPORTING PROTOCOLS
SIP works in conjunction with RSVP (Resource Reservation Protocol), RTP/RTCP (Real-time Transport
Protocol), RTSP (Real-time Streaming Protocol), SAP (Session Announcement Protocol) and SDP (Session
Description Protocol). RTP/RTCP is used for transporting real time data, RSVP for reserving resources,
RTSP for controlled delivery of streams, SAP for advertising multimedia sessions and SDP for describing
multimedia sessions. H.323. too works in conjunction with RTP and RTCP (Real-time Control Protocol).
The present day voice gateways usually compose of two parts: the signaling gateway and the media
gateway. The signaling gateway communicates with the media gateway using MGCP (Media Gateway
Access Protocol). MGCP can interoperate with both SIP and H.323
Media Gateway Control Protocol (MGCP)
It is a protocol that defines communication between call control elements (Call Agents) and telephony
gateways. Call Agents are also known as Media Gateway Controllers. It is a control protocol, allowing a
central coordinator to monitor events in IP phones and gateways and instructs them to send media to
specific addresses. It resulted from the merger of the Simple Gateway Control Protocol and Internet
Protocol Device Control. The call control intelligence is located outside the gateways and are handled by
external call control elements, the Call Agent. MGCP assumes that these call control elements or Call
Agents will synchronize with each other to send coherent commands to the gateways under their
control. It is a master/slave protocol, where the gateways are expected to execute commands sent by
the Call Agents. It has introduced the concepts of connections and endpoints for establishing voice paths
between two participants, and the concepts of events and signals for establishing and tearing down
calls. Since the main emphasis of MGCP is simplicity and reliability and it allows programming difficulties
to be concentrated in Call Agents, so it will enable service providers to develop reliable and cheap local
access systems.
Real-Time Streaming Protocol (RTSP)
RTSP, the Real Time Streaming Protocol, is a client-server protocol that provides control over the
delivery of real-time media streams. It provides "VCR-style" remote control functionality for audio and
video streams, like pause, fast forward, reverse, and absolute positioning. It provides the means for
choosing delivery channels (such as UDP, multicast UDP and TCP), and delivery mechanisms based upon
RTP. RTSP establishes and controls streams of continuous audio and video media between the media
servers and the clients. A media server provides playback or recording services for the media streams
while a client requests continuous media data from the media server. RTSP acts as the "network remote
control" between the server and the client. It supports the following operations:
Retrieval of media from media server: The client can request a presentation description, and ask
the server to setup a session to send the requested data. The server can either multicast the
presentation or send it to the client using unicast.
Invitation of a media server to a conference: The media server can be invited to the conference
to play back media or to record a presentation.
Addition of media to an existing presentation: The server or the client can notify each other
about any additional media that has become available.
Resource Reservation Protocol (RSVP)
The network delay and Quality of Service are the most hindering factors in the voice-data convergence.
The most promising solution to this problem has been developed by IETF viz., RSVP. RSVP can prioritize
and guarantee latency to specific IP traffic streams. RSVP enables a packet-switched network to emulate
a more deterministic circuit switched voice network. With the advent of RSVP, VOIP has become a
reality today. With RSVP enabled, we can accomplish voice communication with tolerable delay on a
data network. RSVP requests will generally result in resources being reserved in each node along the
data path. RSVP requests resources in only one direction, therefore it treats a sender as logically distinct
from a receiver, although the same application process may act as both a sender and a receiver at the
same time. RSVP is not itself a routing protocol, it is designed to operate with current and future unicast
and multicast routing protocols. In order to efficiently accommodate large groups, dynamic group
membership, and heterogeneous receiver requirements, RSVP makes receivers responsible for
requesting a specific QoS. A QoS request from a receiver host application is passed to the local RSVP
process. The RSVP protocol then carries the request to all the nodes along the reverse data path to the
data source. RSVP has the following attributes:
It is receiver oriented
It supports both unicast and multicast
It maintains soft state in routers and hosts, providing graceful support for dynamic membership
changes
It provides transparent operation through routers that do not support it
Session Description Protocol (SDP)
SDP is intended for describing multimedia sessions for the purpose of session announcement, session
invitation etc. The purpose of SDP is to convey information about media streams in multimedia sessions
to allow the recipients of a session description to participate in the session. SDP includes the following
information:
Session name and purpose
Address and port number
Start and stop times
Information to receive those media
Information about the bandwidth to be used by the conference
Contact information for the person responsible for the session
The above information is conveyed in a simple textual format. When a call is set up using SIP, the INVITE
message contains an SDP body describing the session parameters acceptable to the calling party. The
response from the callee includes a SDP body describing the capabilities of the callee. In general, SDP
must convey enough information to be able to join a session and to announce the resources to be used
to non-participants that may need to know. The media information that SDP sends are: type of media
(audio or video), transport protocol (RTP, UDP etc) and media format.
Session Announcement Protocol (SAP)
This protocol is used for advertising the multicast conferences and other multicast sessions. A SAP
announcer periodically multicasts an announcement packet to a well-known multicast address and port
(port number 9875). A SAP listener learns of the multicast scopes using the Multicast Scope Zone
Announcement Protocol and listens on the well known SAP address and port for those scopes. There is
no rendezvous mechanism the SAP announcer is not aware of the presence or absence of any SAP
listeners. A SAP announcement is multicast with the same scope as the session it is announcing,
ensuring that the recipients of the announcement can also be potential recipients of the session being
advertised. If a session uses addresses in multiple administrative scope ranges, it is necessary for the
announcer to send identical copies of the announcement to each administrative scope range. Multiple
announcers may announce a single session, as an aid to robustness in the face of packet loss and failure
of one or more announcers. The time period between repetitions of an announcement is chosen such
that the total bandwidth used by all announcements on a single SAP group remains below a
preconfigured limit. Each announcer is expected to listen to other announcements in order to determine
the total number of sessions being announced on a particular group. SAP is intended to announce the
existence of a long-lived wide area multicast sessions and involves a large startup delay before a
complete set of announcements is heard by a listener. In order to reduce the delays inherent in SAP, it is
recommended that proxy caches be deployed. A SAP proxy is expected to listen to all SAP groups in its
scope and maintain an up-to-date list of all announced sessions along with the time each announcement
was last received. SAP also contains mechanisms for ensuring integrity of session announcements, for
authenticating the origin of an announcement and for encrypting such announcements.
HARDWARE STANDARDS
Some hardware standards for computer telephony have come up over the past few years. They
attempt to provide interoperability among the telephony products from different vendors. Two
of these standards (SCBus and S.100) are discussed below:
SCBUS
The SCBus is a high speed digital TDM (Time Division Multiplexing) bus developed for computer
telephony. It is a standalone component of SCSA (Signal Computing System Architecture) that
makes it easier to build more scalable systems using devices from multiple vendors. It provides
tight integration of hardware resources from different vendors. The features provided by SCBus
include [SCSA]:
It is based on a single distributed switching model
It provides clock management for real-time communications
it allows developers to build large distributed systems
It supports 16 synchronous serial data lines for real-time communication between
devices in a single node
The SCBus standard has been endorsed by American National Standards Institute (ANSI) and
telephony products from several vendors are based on it.
S.100
S.100 is a standard API (Application Programming Interface) for computer telephony. It
provides an effective way to develop computer telephony applications in an open environment.
S.100 is based on a client-server model and the client applications use a collection of services to
allocate, configure, and operate hardware resources. The implementation details of call
processing hardware and switch fabrics are hidden by S.100 so as to allow portable applications
to be written. The services provided by S.100 can be mainly categorized under the following
heads:
Session/Event Management: Session/Event Management is the collection of services
that allow a client to authenticate itself to a S.100 server and allows it to manage
message communication between the client and the server. It provides a logical channel
between the client application and the server, and an associated event queue through
which the application can receive events from the server.
Group Management: The function provided by it allows a group to be treated as a single
entity by the application. It configures the group, keeps track of the resources owned by
the group and the session that owns the group.
Resource Allocation and Management: In order for a group to actually perform an
operation, all of the hardware components required by the operation must be allocated
to the application and properly configured. The resource management service takes
care of this issue.
Run Time Control: It is a mechanism provided by the S.100 server which allows a group
resource currently performing an operation to modify that operation as the result of a