Recovery Behavior of Communication Manager From Control Network Outages
Recovery Behavior of Communication Manager From Control Network Outages
Recovery Behavior of Communication Manager From Control Network Outages
PAPER
May 2007
avaya.com
Table of Contents
Section 1: Introduction
Communication Manager, like all IP PBXs, is dependent on the reliability of the IP network. The impact an
IP network outage has on connected and transient calls depends on the extent and duration of IP network
outages. Communication Manager is able to preserve stable connected calls and connect transient calls during
brief control network outages. Communication Manager is unable to preserve connected calls and connect
transient calls during extended control network outages. The impact of brief control network outages is
discussed in this document.
Avaya encourages its customers to include redundancy within its IP network architecture to maximize
Communication Manager’s ability to maintain telephony service via alternate control paths. Avaya disclaims
any liability for any damages resulting from any IP network outages and the loss of telephony service.
This document describes the recovery behavior of Communication Manager when control network outages
occur. It provides sample scenarios and describes the behavior expected. This document does not guarantee
certain behavior with every type of call and does not provide an exhaustive list of feature interactions.
outage plus the TCP recovery time. Therefore, at the maximum IPSI Socket Sanity Timeout setting of 15
seconds, this feature accommodates only short network outages of up to approximately 7 seconds.
In this document, the term “Control Network outage” is defined as the total outage seen by the application,
that is, the actual network outage plus the TCP recovery time.
The server exchanges heartbeats with the IPSIs every second. IPSI sanity failure occurs when a heartbeat is
missed and if no other data has been received from the IPSI during the last second. During a Control Network
outage, the server and the IPSIs buffer all downstream and upstream messages in queues. If the socket
communication is restored before the IPSI Socket Sanity Timeout is reached, the socket communication
resumes and all queued messages are sent. This recovery is represented by Region A in Figure 2.
avaya.com
If the IPSI sanity failures last longer than the IPSI Socket Sanity Timeout setting but shorter than 60 seconds,
then recovery actions are initiated, including closing and reopening the socket connection (all downstream
and upstream messages buffered in queues are lost), resetting the PKTINT (Packet Interface on the IPSI) and
performing a warm restart of the affected port network. This recovery is represented by Region B in Figure 2.
If the IPSI sanity failures last longer than 60 seconds, the affected port network goes through a cold restart.
This recovery is represented by Region C in Figure 2.
If both primary and secondary IPSI connections have concurrent network outages (most likely due to non-
diverse-path routing), the secondary IPSI connection is not viable and thus not available for interchange.
(PKTINT) is reset. This results in lost upstream and downstream messages and results in LAPD links being
reset and C-LAN socket connections being closed and reopened. Most stable calls will stay connected. Calls
in transition may be lost. See Table 2 for more details on recovery behavior in Region B.
Information on administering the IPSI Socket Sanity Timeout value is documented in the following Product
Support Notice:
PSN# PSN001217u
The CM’s knowledge of the affected calls is cleared at 60 seconds, but the port network is not reset until the control network is restored and .
1
Table 1: Recovery Behavior in Region A
Recovery behavior when the Control Network Outage is shorter than the IPSI Socket Sanity Timeout
What happens Stable calls stay up and are not affected.
to stable calls?
What happens In general, transient calls will complete with a slight delay due to the network outage.
to calls in
transition?
Does that mean Yes. Some transient calls may fail. Due to the delays in the upstream and downstream
some calls may messages, there is a chance of some calls not completing properly or being dropped.
fail?
What are some As an example, there’s a chance of calls originating from an analog or DCP phones not
examples of completing due to the delays setting up touch tone receivers before the user presses the
call failures? DTMF digits.
Stable calls (calls that have established talk- Stable calls stay up and are not affected.
path and are not being transitioned during These include calls that are connected
the network outage) to agents or are in queue listening to
announcements or music. They are not
affected by the short network outage.
How about calls in transition? (calls that are Most calls will complete with a delay due to the
being transferred, calls that are being de- network outage. Some calls in transition may
queued and being routed to an agent, …) fail due to possible race conditions of buffered
messages. Some calls may be abandoned by
the caller if for example, music was removed
but connection to the agent was delayed due
to the network outage. The Redirection on IP
Failure (RoIF) feature could minimize these
race conditions if remote agents are on IP
How about call
endpoints with auto-answer enabled.
center calls?
Calls in queue that arrived through trunks If both the incoming trunk and the available
at the main port network and the available agent are at the main (unaffected) port
agent is at the main port network. network, the call will be unaffected by the
network outage.
Calls in queue that arrived through trunks at If one or more of the parties of a call is
the affected remote port network and/or the connected through the affected port network,
available agent is at the affected remote port any transitions of the call made during the
network. network outage may be delayed or may result
in lost or dropped call.
How about call center agents at IP Phones The socket connection to the phone stays
or IP Agent Softphones? up, so the phone stays registered and the
agent stays logged-in. The agent’s state will
change to Aux Work if a call delivered during
the network outage fails due to the delays.2
CMS3 CMS will bridge a 15-second network delay.
2
This behavior is triggered by the RoIF (Redirection on IP Failure) feature for Auto-answer agents.
3
It is strongly recommended that adjuncts are co-located with the main servers.
avaya.com
Recovery behavior when the Control Network Outage is shorter than the IPSI Socket Sanity Timeout
Data is buffered and delayed during the outage. If the buffer overflows, data is dropped and
CDR3 call records may not be complete. (Buffer is 17,326 records in CM3.1.)
Messages will be delayed during the outage. The AES link has a heartbeat message
between the server and AES that has a 20 second timeout. As long as this timeout is not
reached the socket stays up. However a 15 second outage at the PCD-IPSI layer could
AES3
result in a 20 second timeout at the GIP-AES layer. This should not result in any message
loss because of the mechanism for reliability in the GIP-AES layer, but will cause a longer
delay before normal service resumes.
3
It is strongly recommended that adjuncts are co-located with the main servers.
4
Without TTS, IP Phones on-hook at the time of the Region B outage would go into discover mode once TCP keep-alives are exhausted and the
phones start to look at the alternate gatekeeper list.
avaya.com
Recovery behavior when the Control Network Outage is longer than the IPSI Socket Sanity Timeout,
but less than 60 seconds
Stable calls Stable calls stay up and are not affected as long as the
participants in the call are on endpoints/trunks that are not
affected by the outage. (See specific endpoint behavior above.)
For example, a DCP to IP Phone call will stay up, but a DCP to
IP Trunk call will drop because all IP Trunk calls are dropped.
How about calls in Most calls that have one or more parties connected via the
transition? (calls that are affected port network will drop if the transition happens during
being transferred, calls the network outage.
that are being de-queued
and being routed to an
agent, …)
Calls in queue that If both the incoming trunk and the available agent are at the
arrived through trunks at main (unaffected) port network, the call will be unaffected by
the main port network the network outage.
and the available agent is
at the main port network.
How about call Calls in queue that When the remote port network goes through a warm restart,
center calls? arrived through trunks at the port network and all agents connected to the affected
the main port network port network become unavailable until the warm restart has
and the available agent completed. Calls in queue will remain in queue or be routed to
is at the remote port agents at the main site.
network.
Calls in queue that If the trunk stays up during the network outage, stable calls
arrived through trunks at will stay up. Calls in transition during the network outage may
the affected remote port be lost.
network.
Agents at DCP endpoints Stable calls stay up. Transient calls may be lost.
that are connected to the
affected port network.
Agents at IP Phones or
IP Agent Softphones that Stable calls stay up. Transient calls may be delayed or lost.
are utilizing affected port Agents remain logged in for 5 minutes after the link goes down.
network’s resources (C- The agent’s state will change to Aux Work if a call delivered
LAN or Media Processor). during the network outage fails2.
CMS socket is closed and reopened. There will be loss of data and may require full pump-
CMS3
up. It is strongly recommended that CMS is always co-located with the main servers.
Data is buffered until the link comes back up. Data is lost and some call records will be
CDR3
incomplete if buffer overflows. (Buffer is 17,326 records in CM3.1.)
AES supports link bounce and multiple C-LAN connections (parallel transport connections
between server and AES). It tries to re-connect automatically in case of a warm reset/lost
AES3
link. AES uses its own sequence numbers and acknowledgements to insure message
delivery so no messages should be lost but there will be an obvious delay.
2
This behavior is triggered by the RoIF (Redirection on IP Failure) feature for Auto-answer agents.
3
It is strongly recommended that adjuncts are co-located with the main servers.
avaya.com
Table 3. Recovery Behavior in Region C
Recovery behavior when the Control Network Outage is longer than 60 seconds
What happens to All calls connected via the affected port network are dropped. See the exception for
stable calls? shuffled IP calls below.
What happens All transient calls made via the affected port network are lost.
to calls in
transition?
How about Shuffled IP-to-IP calls that are not using any port network resources will stay connected
shuffled IP calls? until the user drops the call.
How about call All (stable and transient) calls connected via the affected port network are dropped.
center calls?
CMS socket is closed and reopened. There will be loss of data and may require full
CMS3 pump-up depending on how long the network outage lasts. It is strongly recommended
that CMS is always co-located with the main servers.
Data is buffered until the link comes back up. Data is lost and some call records will be
CDR3
incomplete if buffer overflows. (Buffer is 17,326 records in CM3.1.)
AES supports link bounce and multiple C-LAN connections (parallel transport connections
between server and AES). It tries to re-connect automatically in case of a lost link. AES
AES3
uses its own sequence numbers and acknowledgements to insure message delivery so no
messages should be lost but there will be an obvious delay.
3
It is strongly recommended that adjuncts are co-located with the main servers.
About Avaya
Avaya delivers Intelligent Unified Communications, Contact
Communications solutions that Centers and Communications
help companies transform their Enabled Business Processes.
businesses to achieve market- Avaya Global Services provides
place advantage. More than comprehensive service and
1 million businesses worldwide, support for companies, small
including more than 90 percent to large. For more information avaya.com
of the FORTUNE 500®, use visit the Avaya Web site:
Avaya solutions for IP Telephony, http://www.avaya.com.