Recovery Behavior of Communication Manager From Control Network Outages

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

WHITE

PAPER

Recovery Behavior of Communication


Manager from Control Network Outages

May 2007
avaya.com

Table of Contents

Section 1: Introduction. ................................................................................................. 1

Section 2: IPSI Socket Sanity Timeout Feature. ......................................................... 1

2.1 Before considering this feature...................................................................... 1


2.2 When would I consider this feature?............................................................... 1
2.3 When would I not consider this feature?......................................................... 1
2.4 Why only 6-7 seconds of WAN outage?........................................................... 1

Section 3: Recovery from Control Network Outages................................................. 2

3.1 Server – IPSI Sockets and Heartbeats............................................................. 2


3.2 Recovery via Alternate Control Path Available to CM........................................ 3
3.3 Recovery Behavior in Region A...................................................................... 3
3.4 Recovery Behavior in Region B...................................................................... 3
3.5 Recovery Behavior in Region C...................................................................... 4
3.6 How about other call scenarios and feature interactions?.................................. 4

Section 4: Administering the IPSI Socket Sanity Timeout......................................... 4

Table 1: Recovery Behavior in Region A................................................................ 5


Table 2: Recovery Behavior in Region B................................................................ 6
Table 3. Recovery Behavior in Region C................................................................ 8
avaya.com


Section 1: Introduction
Communication Manager, like all IP PBXs, is dependent on the reliability of the IP network. The impact an
IP network outage has on connected and transient calls depends on the extent and duration of IP network
outages. Communication Manager is able to preserve stable connected calls and connect transient calls during
brief control network outages. Communication Manager is unable to preserve connected calls and connect
transient calls during extended control network outages. The impact of brief control network outages is
discussed in this document.

Avaya encourages its customers to include redundancy within its IP network architecture to maximize
Communication Manager’s ability to maintain telephony service via alternate control paths. Avaya disclaims
any liability for any damages resulting from any IP network outages and the loss of telephony service.

This document describes the recovery behavior of Communication Manager when control network outages
occur. It provides sample scenarios and describes the behavior expected. This document does not guarantee
certain behavior with every type of call and does not provide an exhaustive list of feature interactions.

Section 2: IPSI Socket Sanity Timeout Feature


Starting with Release 3.1.3, Communication Manager allows changing the IPSI Socket Sanity Timeout value from
the default 3 seconds to a value up to 15 seconds. This enables IP-remoted IPSI media gateways (Port Networks),
such as the G650, to be more tolerant of short network outages. The intent of the administrable IPSI Socket
Sanity Timeout feature is to allow more time for CM and the affected port network to recover from a network
outage before closing the IPSI socket connection, which can cause data loss and port network warm restarts.

2.1 Before considering this feature


All best current practices should be applied to ensure a robust WAN networking environment. This includes
proper configuration (QoS, speed, duplex, and etc.) of all nodes in the control network including all server
and IPSI network interfaces. This feature does not fix any network problems causing call signaling delays or
outages. Service provider SLAs should be reviewed and optimized. The use of IPSI duplication with diverse-
path network routing and/or the use of Converged Network Analyzer (CNA) should be considered.

2.2 When would I consider this feature?


This feature should only be used if there is a good understanding of the source and the nature of the network
outages being experienced. If the outages are confirmed to be of duration longer than 1-2 seconds, but shorter than
6-7 seconds, this feature should be considered. If the outages tend to be longer than this, this feature is ineffective.

2.3 When would I not consider this feature?


If the duration of WAN outages is known to be greater than 6-7 seconds, this feature should not be used and
other actions should be taken to address the outages. Also, prematurely deploying this feature into a WAN
environment where outages are not well understood is risky as it may lead to further customer dissatisfaction
should it prove to be ineffective.

2.4 Why only 6-7 seconds of WAN outage?


When there is a network outage or packet loss, the Transmission Control Protocol (TCP) initiates recovery by
retransmitting the lost or corrupt segment. This recovery action by TCP takes approximately the same duration
as the network outage. Only after TCP recovers from the network outage, data flows to the applications. The
IPSI Socket Sanity Timeout is an application-level timer that times the total outage, that is, the actual network
avaya.com


outage plus the TCP recovery time. Therefore, at the maximum IPSI Socket Sanity Timeout setting of 15
seconds, this feature accommodates only short network outages of up to approximately 7 seconds.

In this document, the term “Control Network outage” is defined as the total outage seen by the application,
that is, the actual network outage plus the TCP recovery time.

Section 3: Recovery from Control Network Outages

3.1 Server – IPSI Sockets and Heartbeats


The CM Server communicates with a Port Network via a TCP socket connection to the IPSI as shown in Figure 1. This
connection is critical to all communications that go through the port network. All control signals for all endpoints and
adjuncts that connect through the Port Network are multiplexed and sent via this TCP socket connection.

Figure 1: Server – Port Network Connection

The server exchanges heartbeats with the IPSIs every second. IPSI sanity failure occurs when a heartbeat is
missed and if no other data has been received from the IPSI during the last second. During a Control Network
outage, the server and the IPSIs buffer all downstream and upstream messages in queues. If the socket
communication is restored before the IPSI Socket Sanity Timeout is reached, the socket communication
resumes and all queued messages are sent. This recovery is represented by Region A in Figure 2.
avaya.com


If the IPSI sanity failures last longer than the IPSI Socket Sanity Timeout setting but shorter than 60 seconds,
then recovery actions are initiated, including closing and reopening the socket connection (all downstream
and upstream messages buffered in queues are lost), resetting the PKTINT (Packet Interface on the IPSI) and
performing a warm restart of the affected port network. This recovery is represented by Region B in Figure 2.

If the IPSI sanity failures last longer than 60 seconds, the affected port network goes through a cold restart.
This recovery is represented by Region C in Figure 2.

Figure 2: Timeline of Control Network Outage and Recovery

3.2 Recovery via Alternate Control Path Available to CM


If an alternate control path to the affected port network is available and viable, interchange to the alternate
control path is made after 3 seconds of IPSI sanity failures even if the IPSI Socket Sanity Timeout is set to a
value higher than 3 seconds. Alternate control path is either:

• Secondary IPSI in the same port network, or

• Fiber connection via ATM switch or Center Stage Switch

If both primary and secondary IPSI connections have concurrent network outages (most likely due to non-
diverse-path routing), the secondary IPSI connection is not viable and thus not available for interchange.

3.3 Recovery Behavior in Region A


If the Control Network outage is shorter than the IPSI Socket Sanity Timeout (Region A in Figure 2), the
upstream and downstream data that were blocked and buffered due to the network outage will resume
flowing after the TCP recovery. All connections that go through the port network are preserved. All messages
are buffered and sent with a delay due to the network outage and recovery. See Table 1 for more details on
recovery behavior in Region A. Refer to Figure 3 when reading Table 1. Note that there are two port networks
in this example. The network outage happens in the WAN. The port network at the remote site is affected by
the network outage, but the port network at the main site is not affected by the network outage.

3.4 Recovery Behavior in Region B


If the Control Network outage is longer than the IPSI Socket Sanity Timeout period, but is shorter than 60
seconds (Region B in Figure 2), then, the port network goes through a warm restart and the Packet Interface
avaya.com


(PKTINT) is reset. This results in lost upstream and downstream messages and results in LAPD links being
reset and C-LAN socket connections being closed and reopened. Most stable calls will stay connected. Calls
in transition may be lost. See Table 2 for more details on recovery behavior in Region B.

3.5 Recovery Behavior in Region C


If the Control Network outage is longer than 60 seconds, the recovery of the affected port network requires
a cold restart of the port network1. This drops all calls going through the affected port network. Only the
shuffled IP-to-IP calls that are not using any port network resources will stay connected until the user drops
the call. See Table 3 for more details on recovery behavior in Region C.

3.6 How about other call scenarios and feature interactions?


This document is not meant to describe the behavior of each different type of call, nor be an exhaustive list
of feature interactions. Instead, it gives you examples of what happens on typical calls so you can extrapolate
out to other call scenarios and feature interactions.

Section 4: Administering the IPSI Socket Sanity Timeout


The IPSI Socket Sanity Timeout is fixed to 3 seconds in CM 3.1.2 and earlier releases. This timeout value is
administrable in CM 3.1.3 and CM 4.0 via an entry placed in ecs.conf file.

Information on administering the IPSI Socket Sanity Timeout value is documented in the following Product
Support Notice:

PSN# PSN001217u

Figure 3: Sample Configuration with Main and Remote Locations

The CM’s knowledge of the affected calls is cleared at 60 seconds, but the port network is not reset until the control network is restored and .
1

CM gets control of the affected IPSI.


avaya.com


Table 1: Recovery Behavior in Region A
Recovery behavior when the Control Network Outage is shorter than the IPSI Socket Sanity Timeout
What happens Stable calls stay up and are not affected.
to stable calls?
What happens In general, transient calls will complete with a slight delay due to the network outage.
to calls in
transition?
Does that mean Yes. Some transient calls may fail. Due to the delays in the upstream and downstream
some calls may messages, there is a chance of some calls not completing properly or being dropped.
fail?
What are some As an example, there’s a chance of calls originating from an analog or DCP phones not
examples of completing due to the delays setting up touch tone receivers before the user presses the
call failures? DTMF digits.
Stable calls (calls that have established talk- Stable calls stay up and are not affected.
path and are not being transitioned during These include calls that are connected
the network outage) to agents or are in queue listening to
announcements or music. They are not
affected by the short network outage.
How about calls in transition? (calls that are Most calls will complete with a delay due to the
being transferred, calls that are being de- network outage. Some calls in transition may
queued and being routed to an agent, …) fail due to possible race conditions of buffered
messages. Some calls may be abandoned by
the caller if for example, music was removed
but connection to the agent was delayed due
to the network outage. The Redirection on IP
Failure (RoIF) feature could minimize these
race conditions if remote agents are on IP
How about call
endpoints with auto-answer enabled.
center calls?
Calls in queue that arrived through trunks If both the incoming trunk and the available
at the main port network and the available agent are at the main (unaffected) port
agent is at the main port network. network, the call will be unaffected by the
network outage.
Calls in queue that arrived through trunks at If one or more of the parties of a call is
the affected remote port network and/or the connected through the affected port network,
available agent is at the affected remote port any transitions of the call made during the
network. network outage may be delayed or may result
in lost or dropped call.
How about call center agents at IP Phones The socket connection to the phone stays
or IP Agent Softphones? up, so the phone stays registered and the
agent stays logged-in. The agent’s state will
change to Aux Work if a call delivered during
the network outage fails due to the delays.2
CMS3 CMS will bridge a 15-second network delay.

2
This behavior is triggered by the RoIF (Redirection on IP Failure) feature for Auto-answer agents.
3
It is strongly recommended that adjuncts are co-located with the main servers.
avaya.com

Recovery behavior when the Control Network Outage is shorter than the IPSI Socket Sanity Timeout
Data is buffered and delayed during the outage. If the buffer overflows, data is dropped and
CDR3 call records may not be complete. (Buffer is 17,326 records in CM3.1.)
Messages will be delayed during the outage. The AES link has a heartbeat message
between the server and AES that has a 20 second timeout. As long as this timeout is not
reached the socket stays up. However a 15 second outage at the PCD-IPSI layer could
AES3
result in a 20 second timeout at the GIP-AES layer. This should not result in any message
loss because of the mechanism for reliability in the GIP-AES layer, but will cause a longer
delay before normal service resumes.

Table 2: Recovery Behavior in Region B


Recovery behavior when the Control Network Outage is longer than the IPSI Socket Sanity Timeout,
but less than 60 seconds
What happens Most stable calls will stay connected.
to stable calls?
What happens Most calls in transition will fail.
to calls in
transition?
Analog & DCP endpoints Stable calls stay connected. Calls in transition may be lost.
PRI Trunk Stable calls stay up. Transient calls are lost (D-channels reset).
The carrier may tear down the trunks due to no response from
CM during the control network outage.
IP Phone Stable calls stay up. Transient calls may be lost. Without TTS
(Time to Service feature released in CM4.0), IP phones need
Details on to re-register and reestablish the socket connection with CM4.
specific With TTS, IP phones stay registered with CM and only have to
endpoints and reestablish the socket connection.
trunks IP Softphone Stable calls stay up. Have to log out and log back in (this can
be done automatically by the softphone). Neither IP Softphone
nor IP Agent support TTS, so they must re-register and
reestablish the socket connection with CM.
IP (H.323) Trunk All calls are dropped. The signaling socket for the IP trunk will
be automatically reestablished after the control network has
been restored.

3
It is strongly recommended that adjuncts are co-located with the main servers.
4
Without TTS, IP Phones on-hook at the time of the Region B outage would go into discover mode once TCP keep-alives are exhausted and the
phones start to look at the alternate gatekeeper list.
avaya.com

Recovery behavior when the Control Network Outage is longer than the IPSI Socket Sanity Timeout,
but less than 60 seconds
Stable calls Stable calls stay up and are not affected as long as the
participants in the call are on endpoints/trunks that are not
affected by the outage. (See specific endpoint behavior above.)
For example, a DCP to IP Phone call will stay up, but a DCP to
IP Trunk call will drop because all IP Trunk calls are dropped.
How about calls in Most calls that have one or more parties connected via the
transition? (calls that are affected port network will drop if the transition happens during
being transferred, calls the network outage.
that are being de-queued
and being routed to an
agent, …)
Calls in queue that If both the incoming trunk and the available agent are at the
arrived through trunks at main (unaffected) port network, the call will be unaffected by
the main port network the network outage.
and the available agent is
at the main port network.
How about call Calls in queue that When the remote port network goes through a warm restart,
center calls? arrived through trunks at the port network and all agents connected to the affected
the main port network port network become unavailable until the warm restart has
and the available agent completed. Calls in queue will remain in queue or be routed to
is at the remote port agents at the main site.
network.
Calls in queue that If the trunk stays up during the network outage, stable calls
arrived through trunks at will stay up. Calls in transition during the network outage may
the affected remote port be lost.
network.
Agents at DCP endpoints Stable calls stay up. Transient calls may be lost.
that are connected to the
affected port network.
Agents at IP Phones or
IP Agent Softphones that Stable calls stay up. Transient calls may be delayed or lost.
are utilizing affected port Agents remain logged in for 5 minutes after the link goes down.
network’s resources (C- The agent’s state will change to Aux Work if a call delivered
LAN or Media Processor). during the network outage fails2.
CMS socket is closed and reopened. There will be loss of data and may require full pump-
CMS3
up. It is strongly recommended that CMS is always co-located with the main servers.
Data is buffered until the link comes back up. Data is lost and some call records will be
CDR3
incomplete if buffer overflows. (Buffer is 17,326 records in CM3.1.)
AES supports link bounce and multiple C-LAN connections (parallel transport connections
between server and AES). It tries to re-connect automatically in case of a warm reset/lost
AES3
link. AES uses its own sequence numbers and acknowledgements to insure message
delivery so no messages should be lost but there will be an obvious delay.

2
This behavior is triggered by the RoIF (Redirection on IP Failure) feature for Auto-answer agents.
3
It is strongly recommended that adjuncts are co-located with the main servers.
avaya.com


Table 3. Recovery Behavior in Region C
Recovery behavior when the Control Network Outage is longer than 60 seconds
What happens to All calls connected via the affected port network are dropped. See the exception for
stable calls? shuffled IP calls below.
What happens All transient calls made via the affected port network are lost.
to calls in
transition?
How about Shuffled IP-to-IP calls that are not using any port network resources will stay connected
shuffled IP calls? until the user drops the call.
How about call All (stable and transient) calls connected via the affected port network are dropped.
center calls?
CMS socket is closed and reopened. There will be loss of data and may require full
CMS3 pump-up depending on how long the network outage lasts. It is strongly recommended
that CMS is always co-located with the main servers.
Data is buffered until the link comes back up. Data is lost and some call records will be
CDR3
incomplete if buffer overflows. (Buffer is 17,326 records in CM3.1.)
AES supports link bounce and multiple C-LAN connections (parallel transport connections
between server and AES). It tries to re-connect automatically in case of a lost link. AES
AES3
uses its own sequence numbers and acknowledgements to insure message delivery so no
messages should be lost but there will be an obvious delay.

3
It is strongly recommended that adjuncts are co-located with the main servers.
About Avaya
Avaya delivers Intelligent Unified Communications, Contact
Communications solutions that Centers and Communications
help companies transform their Enabled Business Processes.
businesses to achieve market- Avaya Global Services provides
place advantage. More than comprehensive service and
1 million businesses worldwide, support for companies, small
including more than 90 percent to large. For more information avaya.com
of the FORTUNE 500®, use visit the Avaya Web site:
Avaya solutions for IP Telephony, http://www.avaya.com.

© 2007 Avaya Inc. All Rights Reserved. .


Avaya and the Avaya Logo are trademarks of Avaya Inc. and may be registered in certain jurisdictions. .
All trademarks identified by ®, TM or SM are registered marks, trademarks, and service marks, .
respectively, of Avaya Inc., with the exception of FORTUNE 500 which is a registered trademark of .
Time Inc. All other trademarks are the property of their respective owners.
05/07 • LB3476

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy