Internet outages are inevitable, frequent, opaque, and expensive. To make things worse, they are poorly
understood, while a deep understanding of them is essential for strengthening the role of the Internet as the
world’s communication substrate. The importance of research on Internet outages is demonstrated by the
large body of literature focusing on this topic. Unfortunately, we have found this literature rather scattered,
since many different and equally important aspects can be investigated, and researchers typically focused
only on a subset of them. And, to the best of out knowledge, no paper in literature provides an extensive
view on this important research topic. To fill this gap, we analyze all the relevant facets of this important
research topic, stepping from the critical review of the available literature. Our work sheds light on several
obscure aspects such as, for example, the different challenges considered in the literature, the techniques,
tools, and methodologies used, the contributions provided towards different goals (e.g., outage analysis and
detection, impact evaluation, risk assessment, countermeasures, etc.), the issues that are still open, etc..
Moreover, it provides several innovative contributions achieved analyzing the wide and scattered literature
on Internet outages (e.g., characterization of the main causes of outages, general approach for implementing
outages detection systems, systematic classification of definitions and metrics for network resilience, etc.).
We believe that this work represents an important and missing starting point for academy and industry to
understand and contribute to this wide and articulate research area.
1. Introduction and Motivation expensive, and poorly understood. They are in-
business operations, and many other applications in just three weeks of monitoring, Katz-Bassett et
require high availability and good performance of al. [132] discovered persistent reachability problems
this critical, highly dynamic, extremely heteroge-
nect to data-centers cost on average about $5,000 many different and equally important aspects while
per minute [191]. Finally, our understanding of In- researchers typically focused only on a subset of
ternet outages is severely weakened by the lack of them. Previous works have surveyed this vast topic
data, information and collaboration from the net- focusing on specific aspects or viewpoints, failing
work operators for which network outages represent to provide an overall picture that could inform re-
a sensitive topic critical to their business. searchers and practitioners that are new to the
Internet outages may happen for a number of rea- topic, or want to expand their specialistic knowl-
sons including (i) natural disasters, such as earth- edge on Internet outages to other aspects, or sim-
quakes and hurricanes; (ii) software attacks, such ply look for new possible applications for their ex-
as worms or prefix hijacking; (iii) physical attacks, pertise. We refer to these works in the relevant
such as military attacks, terrorism, or electromag- sections of our survey. With this survey, we pro-
netic pulse attacks; (iv) accidental misconfigura- vide the research community with a comprehensive
tion; (v) certain forms of censorship such as the one view on Internet outages discussing all the relevant
adopted during the Arab spring; (vi) software bugs aspects, offering a reference starting point for re-
(e.g., in the routers); (vii) network equipment hard- searchers willing to understand and/or contribute
ware failures, and many other factors. We thor- to this wide and articulate research area. We also
oughly discuss the literature on network outages in consider an up-to date list of notable Internet out-
relation with those causes in Section 2.1. ages, on a time span of 17 years, testifying the vari-
Although the Internet proved to be relatively ro- ety and the frequency (almost one per year) of ma-
bust to localized disruptions, Internet outages can jor disruptions of access to Internet services; Fig-
still leave large sections of the population without ure 2 reports the timeline of notable outages, that
network access for short or large time periods de- are discussed in Section 3.
pending on their extent [91]. A deep understand-
ing of Internet outages is essential for strengthening
1.1. Challenges
Physical Systems and the Internet of Things [26]. in the following and addressed in this paper.
venting outages in the network is the best option to derstanding the effects of these events when refer-
guarantee high availability to the services and appli- ring to roads, buildings, or human beings is quite
cations relying on the Internet. At the same time, straightforward. On the other hand, assessing their
fast and accurate detection of an ongoing outage effects on the Internet infrastructure is much less
is the essential preliminary step to trigger effective obvious. Indeed, even in case of link and node fail-
countermeasures whose primary goal is to mitigate ures, the routing might be able to automatically
as much as possible the impact of the outage as per- find a new stable configuration, guaranteeing good
ceived by the final users. All these imperative op- connections between any pair of nodes in the net-
erations, however, require a deep understanding of work. However, this may happen or not depending
Internet outages. Questions including “why, when, on whether the underlying physical topology does
and where do these events occur? what is the ex- allow it, while the new stable configuration of the
pected impact? how are they likely to happen? ” call network may not be the optimal one. Understand-
for the development of theoretical and practical in- ing how these automatic processes take place when
struments. Internet outages arise is important to gather the
The importance of this topic is demonstrated essential knowledge to (i) model similar events; (ii)
by the large body of literature focusing on Inter- clarify how the network reacts in these cases; (iii)
net outages. We have found that this literature recognize ongoing Internet outages; (iv) develop ef-
is rather scattered since Internet outages include fective solutions.
July 2015
Aug 2003 Feb 2008 Oct 2012
Submarine cable
A large blackout Pakistan Telecom Hurrican Sandy caused cut disconnected
in the U.S.A. hijacked YouTube’s the outage rate in the Commonwealth
impacted 50% of IP address space U.S.A. networks to of the Northern
Sep 2001 all Internet ASes (g) double Mariana Islands
The 9/11 (c) Feb 2009 (j) (m)
Dec 2006 Aug 2014
terrorist attack SuproNet shared Feb-Apr 2017
An earthquake The globally
caused 1-2% of unusual routing Oct-Dec* 2017
in Taiwan routable prefixes
Internet prefixes updates causing Cameroon state-
reduced China’s overcame 512K
to be instability for mandated
Internet access causing old routers
unreachable 4.8% of all disconnection of its
capacity by 74% to strongly
(b) Internet prefixes North West and
(e) underperform
(h) South West areas
Jul 2001 Aug 2005 Jan 2008 Mar-Apr 2015
Oct 2016
A train derailed in Hurricane Katrina Submarine cable Jan-Feb 2011 DDoS attack
DDoS attack
disrupted Censorship from China
Maryland caused cuts in the targets DNS service
communications actions made targets
fibre backbone Mediterrean provider Dyn,
for 134 networks impacted more Lybia and Egypt censorship
disruptions disrupting
impacting seven (d) than 20 millions disappearing high-profile web
services by
major U.S.A. ISPs Internet users from the public services and a
Figure 2: A timeline reporting concrete examples of some of the main and well known Internet outages. References
1.1.2. How to detect Internet outages 1.1.4. How to quantify network robustness to Inter-
The alarms are useful to estimate the frequency, the vulnerability of the network to similar events
the scope, and the root causes of network outages is an important task also to plan new investments
and trigger (predefined) effective countermeasures aiming at improving the network infrastructure.
as well.
1.1.5. How to assess the risk of disruptive Internet
1.1.3. How to quantify the impact of an outage outages
When an outage affects the infrastructure, a As for any other critical infrastructure, assessing
widely accepted set of theoretical and practical in- the risk of an outage of the Internet enables network
struments is needed to quantify the caused dam- operators to insure their infrastructure with private
age. In this way, we can compare different outage insurance companies. Unfortunately, the set of the-
episodes, rank them, and focus on the most dis- oretical and practical instruments normally used to
ruptive ones to learn how to deal with them in the assess the risk of critical infrastructure cannot be
future. applied as is to the Internet environment for sev-
eral reasons. For instance, the Internet is still po- disrupting events in real-time or soft real-time
tentially able to perfectly deliver services in case of (Sec. 4). Based on what we learned from these
outages by effectively re-routing the traffic. two steps, we then analyze the metrics and
approaches used in literature to quantify the
1.1.6. How to survive to and mitigate Internet out- impact of an Internet outage (Sec 5). Succes-
ages sively, we focus our attention on papers propos-
None of the networks part of the Internet is ing approaches or metrics to assess the robust-
immune from outages. Even great investments ness of the network (Sec. 6) or the risk asso-
in a given network do not prevent this network ciated with these events (Sec. 7). Finally, we
from being affected by outages occurring in other examine the countermeasures proposed in lit-
portions of the Internet. The obvious conclusion is erature to recover from Internet outages and
that network operators have to deal with Internet prevent or mitigate their consequences (Sec. 8).
outages. For this reason, we need systems and
approaches to recover from outages, prevent them • The previous steps are then used as input to
or mitigate as much as possible their impact on derive the open issues in the field of Internet
the network. outages (Sec. 9). We provide concluding re-
marks in Sec. 10.
When facing these challenges, the research commu-
nity must also deal with the complex operational To the best of our knowledge, this is the first
climate of the Internet, where independent net- comprehensive survey on Internet outages avail-
works are forced to collaborate and compete. In able in literature. We believe that this paper pro-
this context, it is not surprising that network op- vides novel and elaborated contributions of interest
erators are reluctant to disclose detailed informa- for the research community, achieved analysing the
tion on the managed infrastructures, thus strongly wide and scattered literature on Internet outages
weakening our understanding of the Internet dy- (we have considered more than 210 related works),
namics and evolution in general, and of Internet and sheds light on the current and future research
outages in particular. issues in this field. We have proposed several cat-
egorisations and taxonomies for the main concepts
1.2. Methodology and Contribution related to Internet outages, elaborated views and
In this paper we provide a survey on Internet out- thorough discussions on the main aspects of this
ages analyzing the articulate state of the art in this important research topic.
field. According to the indications reported in [138],
we adopt the research methodology described in the
following (see Fig. 1 for an overall view). 2. Background
With this definition, we aim at underlying few im- causes, as derived from the brief descriptions in
portant concepts. First of all, we want to empha- the reports. Notably, sometimes the classification
size the clear separation between the network out- mixes root causes with intermediate ones, e.g. for
age (i.e. a suboptimal network condition) and its the category Acts of nature, are considered as sub-
original cause (e.g., natural disasters, human errors, categories namely Cable, Power supply, Facility as
etc.): these two aspects are often confused in liter- damaged from burrowing animals or lightning, and
ature. As the relation between the occurrence of Natural disasters, exemplified as Earthquakes, Hur-
perturbing events (the causes) and the effect of the ricanes, or Floods. Moreover, Overloads are also
event on the network status (outage) is not triv- reported as outage category, although considered
ial, the analysis and modeling of outages is severely “somewhat problematic” due to the very definition
limited when a clear distinction between events and of outage based on affected customers: in the con-
status is not performed. An outage occurs every sidered outage reports the number of customers are
time the network deviates from the expected opera- the served ones, not the ones rejected. We high-
tional status. For instance, a network congested by light that, specifically with regards to overloading,
legitimate traffic should not be considered as sub- the Internet differs radically from PSTN: first, the
jected to an outage since a similar event is somehow packet-switched and datagram-oriented nature of
expected although undesirable. On the other hand, the Internet, as opposed to circuit-switched PSTN,
a network poorly performing due to the presence allows for a smooth degradation of performance in-
of malicious traffic (e.g., traffic generated by worm stead of service rejection in case of overloading; sec-
or network attacks) is subjected to an outage since ond, the Internet has been designed as a best-effort
this is a clear deviation from a normal operational service and is used as such by residential and most
status. Finally, network outages are typically well small enterprise customers, with no pre-allocation
localised and their impact is often stronger in the of resources before accepting (or rejecting) a com-
proximity of the affected portion of the network. munication. These circumstances, and the consid-
In the rest of the paper, we use the term network eration that lack of prevision or support of user
outages, failures, or disruptions interchangeably. demand is mostly a business-related matter, more
In literature, classification of network outages is than a technical one, led us to exclude overload-
performed based either on the consequences of the ing (and congestion-related issues) per se from the
disrupting events, or on their causes. We report outages, except when they are consequence of in-
in Figure 3 the classifications of outages as defined tentional action or other failures. We reported [141]
in the papers that addressed outage classification at for historical background and to introduce signifi-
more general level; in the figure we include our clas- cant aspects (overloading, natural causes, issues in
sification, that is described in detail in Section 2.1. classifying outages); in the following we consider
In the following we discuss the reported classifica- works explicitly aimed at Internet outage analy-
(PSTN) [141]. This work classifies outages based A logical link is defined as the connection between
on their cause, and provides an estimation of the Autonomous Systems (hereafter simply AS) made
impact they had on customers (in terms of lack possible by several physical links. The authors clas-
of service). Considered outages were reported by sify network outages in the following six categories
operators, according to U.S. Federal Communica- sorted by severity: partial peering teardown (a few
tions Commission requirements [36]—i.e. their du- but not all of the physical links between two ASes
ration was of 30 minutes or more, or potentially fail); AS partition (internal failure breaks an AS
affected more than 30,000 customers, or affected into a few isolated parts); depeering (discontinu-
airports, 911 Service (emergency telephone num- ation of a peer-to-peer relationship); teardown of
ber), nuclear power plants, major military instal- access links (failure disconnects the customer from
lations, and key government facilities. Implicitly its provider); AS failure (an AS disrupts connection
this operatively defined (major) outages based on with all of its neighboring ASes); regional failure
their consequences, while the subsequent classifica- (failure causes reachability problem for many ASes
tion and analysis reported in [141] is focused on in a region). Another work based on the conse-
quences of an outage is [66], where a network relia- categories are: large-scale disasters (e.g., earth-
bility metric is proposed, based on Mean Time To quakes, hurricanes, pandemics), socio-political and
first Failure and Mean Time To Recovery of devices economic challenges (e.g., terrorism, censorship),
and links. To this aim, the authors survey differ- dependent failures (e.g., power shortages), human
ent statistical models of network failures derived errors (e.g., misconfigurations), malicious attacks
from empirical data, stressing the general falseness (e.g., prefix hijacking attacks), unusual but legiti-
of the assumption of independence between faults. mate traffic (e.g., crowds looking for information
The reasons causing multiple failures are classified on a breaking news), and environmental challenges
in (e.g., due to the mobility of nodes in an ad-hoc
network). In their study, they analyze both spatial
• structural —common services or components, and temporal properties of these failures: the for-
e.g. equipment shared among providers, phys- mer are useful to quantify geographic distances that
ical infrastructure shared among carriers; must be put between data centers so to guarantee
• dynamic—failure of a component or service in- that a certain outage is not likely to affect multiple
creases the stress on other ones; data centers at a time, whereas the latter can be
used to characterize recovery times.
• epistemic—a failure is not observed until an- Note how researchers adopted different points of
other occurs, hidden e.g. because of redun- view in their outage classification: Markopoulou et
dancy or automatic recovery mechanisms. al. [166] and Çetinkaya et al. [58] used the causes of
the Internet outages to classify them whereas Wu
Markopoulou et al. [166] provided an outage clas-
et al. [247] and Cholda et al. [66] focused on the
sification from the point of view of the backbone
consequences of the outage. If causes at the basis
service provider. Based on IS-IS protocol mes-
of the outage are ignored, any derived character-
sages, simultaneity, and optical-to-IP layer map-
ization or model is hardly generalizable to other
pings, the authors progressively narrow down the
2.2. Causes
• optical-related—including optical device fail- Concrete examples of events causing Internet
ure and cable cuts; outages are discussed in the following, leading to
reported in Fig. 2.
• single-link high-frequency—including end-of-
life deployments suffering from ageing, or pro- 2.2.1. Natural causes
longed testing/upgrade activities Several studies in literature investigated the im-
• single-link low-frequency—single failures with pact of natural disasters on IP networks demon-
unspecified cause strating how earthquakes, hurricanes, and thun-
derstorms can cause severe network disruption.
Finally, Çetinkaya et al. [58] presented a general Examples are the Taiwan earthquake (2006) [9],
categorization of network challenges (i.e. poten- the Japan earthquake (2011) [23], and the Kat-
tial triggers of faults in networks, thus a concept rina (2005) and Sandy (2012) hurricanes [11, 74]
wider than causes for outages). The considered and [21]. Besides catastrophic natural events,
characterization axes
origin intentionality disruption type
primarily pure
source class natural human accidental intentional
physical logical
large scale disasters - 7 7
socio-political and
7 7 -
economic challenges
dependent failures - - -
et al. [58]
human errors 7 7 -
malicious attacks 7 7 -
unusual but
7 7 7
legitimate traffic*
environmental challenges** - - 7
maintenance* 7 7 -
router-related - 7 -
Markopoulou shared optical-related 7 7 7
et al. [166] multiple-links failure - - -
high failure links - - 7
low failure links - - -
7 The class is characterized according to the specified cause type.
- The class does not distinguish inside the specified characterization axis.
* This class does not fall in our definition for outages.
According to our definition, some cases comprised in this class
are not considered outages
also minor—yet powerful—local natural events can shortage in the country. We refer the reader
cause severe disruption and long lasting outages, to [91, 94, 205] for a comprehensive insight into in-
if they affect critical parts of Internet infrastruc- terdependencies among critical infrastructures.
ture. As an example, the cable cut occurred
sea [27] (and that in 2008 made a boulder roll over may disappear from the public Internet as the con-
and sever the same cable). Nevertheless the outage sequence of censorship measures. This form of cen-
affected, with Internet, also phone/SMS, banking , sorship was put in action in 2011 in Libya [72] and
airlines, and the Weather Service office, leading the Egypt [71]. Note that not all the censorship tech-
Government of CNMI to declare a “State of Signifi- niques cause network outages, more subtle tech-
cant Emergency” because of the outage, that lasted niques exist [69, 33]. The reader may refer to [144]
for more insights into censorship and co-option in
3 weeks.
the Internet. Logical or physical disruption can
Network outages can be the direct consequence be obviously caused by attacks. Prefix hijacking
of power shortages. Examples are the blackouts attacks[42], forcing the traffic sent to a set of des-
occurred in Moscow during May 2005 [20] and in tinations to be routed along different routes, cause
Northeastern United States during August 2003. logical disruption. An example is the prefix hijack-
The latter involved almost 50 million people and ing attack on YouTube performed by a Pakistani
affected about 50% of all Internet ASes [73]. Black- ISP [52]. An increasingly evident type of attack is
outs may also be the consequence of other events: constituted by massive Distributed Denial of Ser-
for instance, after the Japan earthquake, nuclear vice (DDoS), often carried on by means of botnets,
power plants were taken offline upon a request of such as the attack in October 2016 against the DNS
the government, leading to a long lasting power service provider Dyn [159], but also with infrastruc-
tures such as the Chinese “Great Cannon” [165]. For instance, earthquakes and hurricanes are natu-
Physical disruptions instead can be caused by ter- ral disasters that accidentally cause physical dis-
rorist or military attacks: for instance, about 1% of ruptions of network routing elements while mis-
the globally announced Internet prefixes showed im- configurations are human faults that accidentally
mediate loss of reachability during the World Trade causes pure logical disruption. Military attacks, in-
Center attack [178, 181]. Another example is rep- stead, are performed by humans to intentionally
resented by the Thai anti-government protestants cause physical disruptions, while maintenance ac-
temporarily cutting off a large portion of Internet tivities can accidentally determine logical disrup-
connectivity of their country in 2013 [157]. tions (while planned downtime for maintenance is
not to be considered as an outage). Of course in-
tentionality is applicable only to events of human
2.2.3. Logical failures origin.
Device misconfiguration is another source of net- Considering the classifications of outages de-
work outages. For instance, route aggregation in scribed in Section 2.1, we have proposed a cause-
border routers can potentially cause persistent for- based approach, therefore we compare with [166]
warding loops and traffic blackholes [146]. The and [58] providing a mapping between our charac-
incidents namely referred to as the Google’s May terization and the cited ones in Table 1. It can
2005 outage [182] and the “China’s 18-Minute Mys- be noted that the aspects we have highlighted are
tery” [70] have been also ascribed to misconfigu- mostly orthogonal to those adopted in former clas-
ration. Rarely, also incorrect de-peering between sifications; for the case of Markopoulou et al. this is
ISPs can cause disruptions [48]. Legacy network highly evident, and reflects the fact that the authors
equipments can generate network outages as well. did not intend to present a general characterization,
For instance, when the number of globally routable but only to characterize a specific operational net-
prefixes overcame 512K in August 2014, the Inter- work from the observation point of the network op-
net suffered significant outages due to old routers erator itself, and their classification is scoped by
limiting their use of a specialized, and expen- the adopted detection and analysis methodology
sive, type of memory known as ternary content- (see Section 4 for more detail). Compared with
addressable memory (TCAM) to 512K prefixes by Çetinkaya et al. we focus on causes, and character-
default [221, 148]. Network outages may also rise ize them according to few fundamental properties,
when ASes decide to de-peer as the result of a busi- while the authors present a coarse, partially over-
nature of the disruption (primarily physical ver- allows operators to troubleshoot network failures
sus pure logical) additionally affects the monitor- and poor network performance by listing the IP
ing/detection possibilities (see Section 4), and as addresses owned by the devices traversed by traffic
a consequence the strategies to cope with outages towards a destination. While Traceroute provides
change as well (see Section 6 for a more in-depth important knowledge to detect and locate network
analysis). outages, this basic diagnostic tool is also known
to be affected by several drawbacks, e.g., load
2.3. Basic detection and analysis techniques balancers [39], anonymous routers [168], hidden
Researchers working in the field of Internet out- routers [192], misleading intermediate delays [164],
ages often rely on a common set of basic techniques and third-party addresses [161]. More in general,
to perform detection and analysis. Based on our active probing is not scalable due to the imposed
experience and after the analysis we did, we clas- network overhead: for this reason, it can be
sify these techniques in the taxonomy reported in profitably used only towards a reduced set of
Fig. 5. destinations periodically probed. Due to these
Basic techniques are either directly related or limitations, active probing can mainly expose
not related to the network traffic. When adopt- large-scale, long-lasting outage events.
ing non-traffic-related basic techniques, researchers
commonly inspect (1.) non-structured data sources Differently from active probing, passive monitor-
such as technical blogs (e.g., Renesys [17]), mail- ing does not inject additional traffic into the net-
ing lists (e.g., NANOG [12] and outages [14]), work but takes advantage of the traffic-related in-
alarms raised by the final users of networks and formation already stored by the network devices or
services complaining through microblogging social by third-parties. These techniques can be further
networks; (2.) semi-structured data sources such as classified according to the type of traffic they are
device usage and error logs, customer emails, qual- related to, specifically control-plane and data plane
ity alarms, and user activity logs; (3.) structured traffic. When focusing on control-plane traffic, re-
data sources such as network trouble tickets. For searchers inspect routing-related information. In
instance, Banerjee et al. [43] recently used text min- case of inter-domain routing, an extremely valuable
ing and natural language processing to analyse the source of information on global Internet dynamics
traffic-related techniques: these techniques can such as Routeviews [19], RIPE RIS [28], PCH [15],
be divided in active probing and passive moni- and publicly accessible BGP looking glasses [137]
toring. Active probing techniques inject into the allow researchers to monitor the best routes an-
network purposely forged synthetic measurement nounced by ASes. To some extent, an Internet out-
traffic: by observing how the network treats the age can be visible in the BGP traffic since it poten-
injected traffic, researchers can infer the status tially causes several best routes to be dropped and a
of the network under investigation to potentially remarkable amount BGP update messages to be ex-
detect failures. Due to the radically distributed changed. Unfortunately, despite their numbers and
privileged locations across the Internet1 , the BGP
ecosystem of networks on which no one has full ing [108]. Only partial routing information can be
access nor control. The most used active probing gathered about those ASes far from the route col-
tools adopted by researchers working on Internet lectors available. Also, since BGP only exposes best
outages are ping and traceroute. Both tools rely routes, researchers need to aggregate BGP data
on ICMP [194]: the former estimates the round over larger periods of time to gather a more com-
trip time related to a given destination as the plete view of the logic interconnections of an AS:
time elapsed between sending an ICMP Echo this process is error prone since also stale intercon-
Request and the receiving the corresponding ICMP nections might be considered. When studying In-
Echo Reply. The lack of responses when using
ping or very high delay may uncover network 1 For instance, PCH BGP route collectors are located at
failures along the path [198, 199, 200]. Traceroute large Internet Exchange Points.
ternet outages, researchers can also take advantage can achieve a more comprehensive understanding
of the type of relations among ASes (e.g., customer- of this single event only by reading all the avail-
provider or peer-to-peer) inferred and made pub- able works. This result is the direct consequence of
licly available by institution such as CAIDA [86, 5]. the lack of assessed methodologies, best practices or
Note, however, that accurately inferring these rela- guidelines in literature for analyzing Internet out-
tions is an extremely complex task since some of ages. We consider this as one of the most important
the basic assumptions on which these inferences open issues in this research area, as we further dis-
are made do not always hold in the real Inter- cuss in Sec. 9.
net [103] and very large ASes, spanning over differ- In the following, we examine the main findings
ent countries or continents may have different types of the most relevant works focusing on this specific
of relation depending on the specific geographic event. Cho et al. [62] investigated the impact of the
area. Also information related to the intra-domain earthquake on a Japanese ISP named IIJ, mainly
routing may provide useful hints on network out- relying on passive monitoring: they analysed (i)
ages. For instance, by logging Interior Gateway inter-domain control plane information, i.e. the
Protocol (IGP) messages generated by the network BGP traffic exchanged by the ISP with a major peer
routers with tools such as Packet Design Route Ex- collected using Quagga MRT [16]; (ii) intra-domain
plorer [18], researchers can investigate how the net- control plane information, i.e. OSPF signalling col-
work reacts to disruptive events. Unfortunately, the lected via Packet Design Route Explorer [18]; and
operational climate of Internet where ASes collab- (iii) data plane information, i.e. traffic volumes
orate and compete generally disincentives the shar- on trans-oceanic links and residential traffic col-
ing of IGP-data with the research community. For lected via SNMP. Despite the large impact caused
this reason, only few researchers had the privilege by the earthquake on the infrastructure of this ISP
to access similar data sources. (trans-oceanic link failures, connectivity lost in six
Finally, as the control plane, also the data plane prefectures in the Tohoku area), the authors no-
provides useful information on network outages. ticed how over-provisioning and traffic re-routing
For instance, several works monitored the aggre- strongly limited the visibility of this disruption out-
gated volumes of traffic in the network. Indeed, side the ISP.
when multiple network components fails, the total Fukuda et al. [97] investigated the earthquake im-
amount of traffic in the network may significantly pact on a different Japanese ISP named SINET4.
drop. Typically, the traffic volumes are compared They also used routing information (BGP and
with patterns referred to as “normal” periods in or- OSPF) and traffic volumes logging the event mes-
der to evaluate the approximate ratio of traffic drop sages generated by routers. The authors observed
due to the outage. (a.) significant traffic drop in the backbone of the
ISP; (b.) the lost of connectivity for several univer-
Internet outage episodes. In this Section we present
Liu et al [154] characterized the inter-domain
and discuss these analyses , ordering them accord-
the impact of this earthquake by mainly rely- the public Internet which continued to achieve high
ing on data-plane measurements performed by a reliability. The consequence of hurricane Sandy
widely adopted peer-to-peer system (i.e. BitTor- has preliminary been investigated in [115]: Heide-
rent), identifying the specific regions and network mann et al. used large-scale ping-based experimen-
links where Internet usage and connectivity were tal campaign to assess if steadily reachable desti-
most affected. The authors used two plugins devel- nations were not reachable any more likely due to
oped for the Vuze BitTorrent client [237], named the hurricane. The authors noticed how the out-
Ono [55] and NEWS [64], to anonymously collect age rate doubled in the area hit by the hurricane
usage statistics as well as passive monitoring and while networks took about four days to fully re-
active measurements such as Traceroutes towards a cover. Aben [30] relied on DNS reverse lookup to
subset of connected peers. By leveraging the view of inspect Traceroute traces traversing NYC and ASH
this popular P2P system, the authors documented areas, noticing how most of the paths were rerouted
an overall decrease in the usage of BitTorrent as around the areas hit by the hurricane Sandy, mak-
ing networks still operational and interconnected in
well as routing changes in the affected area.
Despite thousands of victims and huge de- spite of the difficult circumstances.
struction, all these works demonstrated that the
most powerful earthquake ever recorded to have 3.4. Intentional disruption
Other earthquakes have been investigated as well. ing and passive monitoring: they exploited control-
Köpp [140] analysed BGP data to shed light on the plane information such as BGP-data and Regional
consequences of an earthquake hitting New Zealand Internet Registries (RIRs) [1] delegation files, but
in 2011. Köpp first identified the network prefixes also data plane information such as unsolicited traf-
fic to unassigned address space (commonly referred
fixes seen at the London Internet Exchange Point unsolicited data plane traffic coming from these ge-
(LINX), noticing only a limited impact. Both the ographic areas significantly dropped. They also ar-
Japanese and New Zealand earthquakes have been gued how analysing IBR can be an effective tool
also investigated in [79]. For sake of completeness, for uncovering large-scale network disruption [46].
how earthquakes impact the Internet has been also Slater [219] used SNMP counters to analyze the
the subject of many works focusing on user activity problems caused by DDoS attack against Internet
on social networks. For instance, Bruns et al. [53] Root Servers occurring during October 2002, as well
investigated user activity on Twitter after the New as consequences of router misconfiguration.
Zealand earthquake. Similarly, Qu et al. [197] pin- Marczak et al. [165] have deeply analyzed a DDoS
pointed the effect of the 2010 Yushu Earthquake on targeted at the censorship circumvention services
a popular Chinese microblogging site. Other simi- offered by, an organization that aims
lar studies can be found in [196, 184, 92, 150, 222, at monitoring and countering Internet censorship in
231, 155, 183, 229, 236, 211]. China. From the inspection of server logs and by
active measurements the researchers assessed the
Man-In-The-Middle nature of the attack and in- prefix assigned to Google. Outages in the Domain
ferred several properties of the infrastructure en- Name System (DNS) have been analysed as well:
forcing it (dubbed as “the Chinese Great Cannon”): Pappas et al. [186, 187] identified delegation in-
the most notable ones are its probabilistic nature consistency, lame delegation, diminished server re-
(impairing detection efforts) and the injection of dundancy, and cyclic zone dependency. These op-
HTML code causing unaware users to participate erational errors heavily impacted availability and
in the attack. query delays. Chan et al. [59] used active probing
A recent DDoS attack with huge impact and vis- to pinpoint the consequence of a submarine cable
ibility has been the one targeting the DNS ser- fault occurred in 2010 on end-to-end network per-
vice provider Dyn, because it indirectly affected the formance. The authors employed Traceroute and
many high-traffic websites using the service [159]. other network diagnostic tools to assess the path-
In the aftermath of the event, the attack was quality degradation in terms of delay and asymmet-
tracked back to botnets of the Mirai type, that ric packet losses.
leveraged vulnerable IoT appliances in the order of Regarding more general network operational fail-
100, 000. While the technical properties of the bot- ures, Markopoulou et al. [166] analyzed IGP mes-
net were not new per-se, the volume of generated sages exchanged in the Sprint backbone to char-
network traffic on the target has reportedly reached acterize failures that affect IP connectivity. Sur-
unprecedented levels of 665Gbps. prisingly, they noticed that about 20% of failures
When the blocking is intentionally enforced by a in the Spring backbone occurred during periods of
state, it is usually enforced at ISPs level [33]. Such scheduled maintenance activities: this large frac-
is the case for the already mentioned Arab Spring in tion of outages seems to be self-inflicted by the net-
2011, and recently for Cameroon (still in act at the work operator. Turner et al. [232] used non-traffic-
time of writing) [29]. Such events have occurred in related data such as email logs, router configura-
non-democratic countries, in correspondence with tions and syslogs to analyse over five years of fail-
elections or social and political unrest, and having ures occurring in the CENIC network serving most
a clear impact on freedom of expression, it is often of the public education and research institutions in
denounced by international human rights organiza- California. Mahajan [158] inspected three weeks of
tions [32], backed by anecdotal references or leaked BGP data to discover how BGP configuration er-
documents (blocking orders to ISPs). rors are pervasive, with 200-1200 prefixes (0.2-1.0%
of the BGP table size) suffering from misconfigura-
3.5. Accidental causes tion each day. Finally, Huang et al. [122] correlated
A class of outages of human origin that can be six months of BGP update streams in the Abilene
ascribed to mistakes or misconfiguration are related backbone with a catalogue on known disruptions of
with prefix hijacking, such as episodes investigated nodes and links noticing the importance of simul-
in [42]. Researchers from RIPE ATLAS [52] used taneously analysing multiple BGP update streams
RISwhois2 and BGPlay to inspect the information to detect most of the important events.
stored by the BGP route collectors and uncover
how Pakistan aimed at blocking the YouTube web- 3.6. Discussion on outage analyses
To this regard, it is important to underline how the damaged network components), or in the case
the adopted basic instruments suffer from severe of logical disruptions (allowing network administra-
limitations as we anticipated in the previous sec- tors to quickly restore a satisfying operational sta-
tion. Indeed, BGP data are well known to provide tus).
a heavily incomplete view of the inter-domain rout- We characterize the detection methods in liter-
ing [108]. Relying exclusively on the control-plane ature based on the adopted monitoring techiques
may also generate false alarms [254]. For instance, (see Table 3) and discuss them accordingly.
this may happen due to default routing [54], caus- Moreover, we extrapolate the general approach
ing Internet traffic to normally reach its destina- adopted by these studies and we propose the
tion even if the corresponding route does not ap- flowchart reported in Figure 6.
pear at the control-plane level. Also data-plane
Detecting network outages typically requires four
measurements are not free of limitations. For in-
steps: (1) Data collection and preprocessing; (2)
stance, Traceroute may induce users to incorrectly
Network outage detection; (3) Outage locating; and
reverse engineering the network path [192, 161,
(4) Root cause analysis. Most systems implement
162, 168, 39]. Also the lack of replies when using
only the first two steps. These systems continu-
ping does not necessarily implies lack of connectiv-
ously monitor the network by collecting data, typ-
ity [121]. Several concerns also exist on the accu-
ically with a combination of the basic techniques
racy of IP Geolocation Databases such as Maxmind
we introduced in Sec. 2, such as active probing or
as extensively demonstrated in different works (e.g.,
passive monitoring. During this step, data filtering
[193, 255]).
and sanitization also take place in order to remove
In conclusion, researchers investigating Inter-
as much noise as possible from the data. Differ-
net outages should carefully consider the validity
ent algorithms are then applied on the refined data
of their conclusions in light of the limitations of
during the second step, to detect large- and small-
adopted tools.
scale events potentially related to Internet outages.
Data Core [212, 78, 85] approach is indeed adopted in scenarios where most
Plane Edge [230, 110, 63] part of the network is under control (e.g. for ISPs),
Based on Ping so root cause analysis and mitigation/resolution of
Active [200, 198, 199, 223]
and Traceroute the outage is more likely thanks to the detailed
Detecting ongoing network outages is important or approaches according to which specific mecha-
to qualitatively and quantitatively understand the nism is primarily employed during the data collec-
4.1.1. Control Plane lies originated from external ASes also several hops
Public BGP repositories proved to be extremely distant. The authors evaluated the system against
helpful also for systematic outage detection, as for the routing outages caused by the Slammer Worm
the analysis of specific outage episodes. BGP data on 2003 and the prefix hijacking by a Turkey Net
publicly available like RIBs and update messages (AS9121) on 2004. Deshpande et al. [81] proposed
are systematically crawled from public repositories an online mechanism for the detection and the anal-
ysis of routing instabilities potentially caused also
(e.g., Routeviews, RIPE, etc.) during the data col-
lection step. This data is then converted in a suit- by network outages. The detection step is based
able format for the subsequent analysis. During on the analysis of time-domain characteristics of
the detection step, a common approach is to group BGP update messages. More specifically, filtering
and adaptive segmentation techniques are applied
all the BGP messages originating by a given event.
This procedure can be affected by inaccuracy (i.e. on time series of feature data in order to isolate pe-
grouping also BGP messages not related to the out- riods of instabilities. BGP route changes are then
age) and incompleteness (i.e. missing part of the also used to locate the ASes originating the routing
BGP messages related to the outage of interest). To instabilities.
a certain extent, such problems are unavoidable due Other detection approaches. Glass et al. [105]
to the forcedly limited visibility of available data also relied on BGP data. The authors adopted ten-
sets. BGP data can also be used to locate the ASes sor factorization for detecting events of interest, and
responsible for the detected disruption. The main graph-theory analysis to locate the origin ASes. A
drawback when relying on BGP data to systemati- similar approach is adopted in [56]. Principal Com-
cally detect network outages is the large number of ponents Analysis [249] and machine learning tech-
false alarms since many legitimate events may also niques [253] have been also proposed for anomaly
determine changes of the path or of the origin pre- detection in BGP data.
fix. We discuss in the following several approaches
4.1.2. Data Plane
outages and evaluating their impact. The main idea tion of the collection points in the network, these
is modelling the normal state of the Internet, and approaches can be grouped in core-based, i.e. ob-
then monitoring the network for a given period to
separate abnormal cluster in case an outage occurs. and outgoing traffic flows. During the data collec-
Time-based change detection. BGP Eye [228] tion process, FACT collects NetFlows records and
clusters BGP updates related to the same BGP aggregates flows per remote host, networks, or AS.
event and correlates the events across multiple bor- The key idea behind the detection process is that a
der routers in order to expose anomalies. The sys- network outage is likely to result in (i) an increas-
tem attempts to identify the root-cause of these ing number of unsuccessful one-way connection to a
anomalies. Two different perspectives are consid- remote destination (a network prefix, an AS, etc.),
ered: (i) an Internet Centric perspective, to track and (ii) a decreasing number of successful two-way
anomalies through an analysis of AS-AS interac- connections. Other researchers proposed to rely on
tions, and (ii) a Home Centric perspective, to pro- unsolicited data plane traffic to detect network out-
vide an insight on how an AS is affected by anoma- ages [78, 85]. Glatz et al. [85] monitored unsolicited
data plane traffic towards a live network to detect (pings) at 11 minute intervals and classifying re-
and characterize fine-grained outages affecting local sponses in two main categories: (i) positive, in case
networked services. of an ICMP reply is received; (ii) negative, in case
Edge-based. A completely different approach is of ICMP replies indicating network is unreachable
represented by the PerfSonar project [230, 110], (e.g., destination unreachable), or in case of lack of
a collaborative network monitoring platform pro- responses. Negative responses potentially expose
viding several network troubleshooting tools. The Internet outages. The authors demonstrate that a
system is deployed in several independent research single computer can track outages across the (an-
and educational networks enabling sharing of data alyzable) Internet by probing a sample of 20 ad-
like specific SNMP counters useful to expose net- dresses in all 2.5M responsive /24 address blocks,
work outages. Finally, Choffness et al. [63] pro- and detect 100% of the outages lasting more than
posed Crowdsourcing Event Monitoring (CEM). 11 minutes.
CEM passively monitors and correlates the perfor- The basic principles behind Trinocular were pre-
mance of end-user applications in order to expose liminary explored in [198, 199]. The authors used
network events including outages. The data collec- this approach to investigate macro-events (hurri-
tion process is implemented as an extension to Bit- cane Sandy in 2012, the Japanese earthquake in
Torrent. Each end node monitors flow and path- March 2012, and the Egyptian revolution in Jan-
quality information such as throughput, loss, and uary 2012) as well as micro-events such as classic
latencies to locally detect an event. daily network outages. They exploited this tech-
Common challenges. The works cited above nique to evaluate the availability of the whole In-
share the difficulty to guarantee user privacy. Djat- ternet: according to their results, the Internet has
miko et al. [87] proposed a general approach to a “2.5 nines” availability.
overcome this limitation. More specifically, they Schulman and Spring [223] designed and de-
designed a distributed mechanism based on Secure ployed a measurement system called “ThunderP-
Multi-Party Computation (MPC) [80] to correlate ing” measuring the connectivity of residential Inter-
NetFlow [7] measurements passively collected from net hosts before, during, and after forecast periods
multiple ISPs. MPC consists of a set of crypto- of severe weather. The data collection process im-
graphic methods allowing different parties to ag- plemented in ThunderPing crawls weather forecast
gregate private data without revealing sensitive in- to identify geographic areas interested by extreme
formation, thus avoiding the aforementioned limita- weather condition (e.g., thunderstorms) and trig-
tions. The authors integrated MPC in FACT [212] gers ping measurements towards residential hosts
allowing network operators to troubleshoot outages located in these areas. ThunderPing demonstrated
by solely relying on flow-level information. how failures are four times as likely during thunder-
storms and two times as likely during rain compared
4.2. Active probing
to clear weather.
Many other outage detection systems primarily
used active probing during the data collection pro- 4.2.2. Tomography-based approaches
cess. These systems often rely on Ping and Tracer- A special family of outage detection systems ex-
oute for periodically probing a number of desti- ploits binary tomography. Binary tomography is
nations from several vantage points. Other ap- the process of detecting link failures by sending co-
proaches are based on tomography techniques, i.e. ordinated end-to-end probes [89]. It only requires
perform targeted end-to-end measurements based network elements to perform classic packet forward-
on the—possibly limited—knowledge of the topol- ing operations. Relying on Ping and Traceroute, in-
ogy of the measured networks (often the whole In- stead, requires the devices to actively collaborate by
ternet). Both the types of active approaches usually providing ICMP responses. Hence, binary tomog-
rely on distributed active measurement platforms raphy is particularly helpful when the networks are
such as Archipelago [4], Planetlab [45], DIMES configured to discard ICMP messages coming from
[215], etc. the outside.
Cunha et al. [75] observed that binary tomog-
4.2.1. Approaches based on Ping and Traceroute raphy is sensitive to the quality of the input: (i)
Quan et al. proposed Trinocular [200], a system the network topology and (ii) a set of end-to-end
probing each IP block with ICMP echo requests measurements, in the form of a reachability matrix,
which indicates whether each path is up or down. root causes of any path change affecting their pre-
The authors developed two methods for generat- fixes. PoiRoot exploits BGP data but also com-
ing higher quality inputs to improve the accuracy bines existing measurement tools (e.g., Traceroute
and the efficiency of binary tomography algorithms. and Reverse Traceroute [133]) to gain higher visi-
They observed that binary tomography algorithms bility on the ongoing events.
cannot be always directly applied in real networks, Another system exploiting both active and pas-
because they tend to generate a remarkable amount sive monitoring is Argus, proposed by Xiang
of false alarms. The authors proposed (i) a probing et al. [248] to detect prefix hijacking. Ar-
method for quickly distinguishing persistent path gus relies on live BGP feeds provided by BGP-
failures from transient congestion, as well as (ii) mon [195] and daily Traceroutes archives made
strategies for aggregating path failures in a consis- available by CAIDA Archipelago [4] and the iPlane
tent reachability matrix. project [156]. The key idea behind this sys-
Dhamdhere et al. [82] proposed a troubleshoot- tem is that routers polluted by a prefix hijacking
ing algorithm, called “NetDiagnoser”, based on bi- usually cannot get a reply from the victim pre-
nary tomography. They extended this technique fix. Accordingly, the authors correlate data-plane
to improve the diagnosis accuracy in the presence (un)reachability with control-plane anomaly from
of multiple link failures. NetDiagnoser actually re- a large number of public BGP route-servers and
lies on Traceroute-like measurements, performed by looking-glasses to expose these network outages.
troubleshooting sensors located at end hosts inside Other similar works exist. For instance, Hu et
multiple ASes, and it also considers information on al. [160] passively collected BGP routing updates
paths obtained analyzing BGP and IGP messages and information from the data plane: the basic idea
after rerouting around a failure. is to use data plane information in the form of edge
Network tomography is a powerful tool. How- network fingerprinting to disambiguate potentially
ever, it is not free of limitations [252]: fast detec- numerous suspect IP hijacking incidences based on
tion of network outages implies high probing rate, routing anomaly detection.
practically infeasible in real networks. Also net-
work dynamics may weaken the basic assumption
that the injected packets are traversing the same 4.4. Discussion
links previously observed. Load balancing [40] fur- Internet outage detection systems that mainly
ther exacerbates this issue. rely on passive monitoring are highly efficient al-
probing and passive monitoring. ously injecting ICMP probes into the network to-
Katz-Bassett proposed Hubble [132], a system wards a large number of representative destina-
detecting Internet reachability problems where tions, Trinocular [200] increases by almost 0.7%
routes to the destination network exist at the con- the Internet “background radiation”, i.e. the un-
trol plane according to BGP public data but pack- solicited traffic that all the networks in the Inter-
ets do not reach the destination network through net observe. This load imposed on the network
the data plane. Data collection relies on BGP data, might be easily considered unacceptable. In ad-
Ping and Traceroute measurements triggered by dition, the necessary trade-off between number of
changes at the control plane. Hubble proved to dis- targeted destinations and sampling period causes
cover 85% of the reachability problems that would outage detection systems relying exclusively on ac-
be found with a pervasive probing approach, but tive probing to likely report only large and long-
reducing the probing traffic by 94.5%. The authors lasting network outages. The hybrid approaches,
also studied the trade-off between sampling and ac- instead, seem to represent the best option since
curacy, arguing that their use of multiple samples they combine the advantages of passive monitoring
per destination network greatly reduces the number and active probing. These systems primarily adopt
of false conclusions about outages. passive monitoring to continuously gather coarse
In [125], Javed et al. designed “PoiRoot”, a real- grained information on the network status. Op-
time system allowing ISPs to accurately isolate the portunistic measurements based on active probing
are triggered to gather additional information only in Section 5.3. We also refer to Section 3.1 for the
when this lightweight process provides clues of pos- analyses of the specific outage event.
sible Internet outages. Similar baseline counts relative to outage impact
are reported by [78] in detecting network effects
5. Outage impact evaluation of censorship in the Arab Spring, where the exis-
tence and extension of outage is measured by num-
ber of network prefixes (continuous sets of IP ad-
Nonformalized [62, 97, 78, 59, 17, 31, 49, 124] dresses) or number of individual IP addresses af-
Formal User [91, 119, 87] fected e.g. by BGP withdrawal or DDoS attacks
metrics [51, 247, 154, 153, 44] (see Section 3.4 for the related analysis method-
[34, 59, 79, 46]
ology). In this case the ultimate effect on users
was inferred as global disconnection of the whole
Table 4: Approaches for outage impact evaluation.
population of the affected countries, as the net-
work prefixes amounted to the total of addresses
Measuring the impact of a network outage is a available to country ISPs. Similar considerations,
key challenge when either analyzing specific Inter- with country-wide affected population, are reached
net outages or systematically detecting them. In regarding submarine cable cuts [59, 17], specially
Table 4 we group the literature on outage impact for countries with limited connection options with
evaluation according to the adopted approaches, neighbours.
discussed heareafter. Most of the scientific works
Finally, consequences of outages (especially
dissecting specific episodes provided only a quali-
caused by censorship) have been qualitatively an-
tative evaluation of the impact of the outage, or
alyzed by NGOs and not-for-profit institutions in
reported a quantification not based on shared or
terms of human rights violations (free speech, ac-
significant metrics. Other works, instead, formally
countability of governments, discrimination) [31,
introduced metrics and approaches adopting either
quake, Cho et al. [62] cites NTT reports with num- The impact of outages is measured according to
bers of damaged base stations, transmission lines, the troubles perceived by the end-users of the net-
and circuits for fixed line services, restrictions to work. For instance, in [91], an outage impact is
voice calls acceptance nationwide. Regarding the evaluated with respect to the estimated popula-
Internet, the impact of the earthquake is found in tion served by the affected network routes, based
volumes of traffic seen by Internet Service Providers on studies (such as [119]) that correlate population
located in the area of the quake [62, 97]. Evaluat- density with Internet usage. The interesting aspect
ing traffic volumes has several drawbacks acknowl- of such approach is that it leads to an impact eval-
edged by the authors themselves: part of volume uation closer to the user, since it tries to consider
drop was due to scheduled power outages due to the affected population rather than measurable net-
restoration works, similarly voluntary shutdowns of work metrics. Similarly, in [87] Djatmiko et al. re-
servers and networks were performed by companies lied on the amount of unreachable BGP prefixes to
(and users) to reduce power consumption (we do estimate the number of affected clients, in order to
not include such intentional operations in the defi- evaluate the severity of an outage. All these works
nition of Internet outages); the reduced usage due implicitly assume that the final mission of the Inter-
to outages was also likely offset by the increased net is to guarantee global connectivity to end-users.
use of the Internet for searching information, and Accordingly, the larger is the section of population
telecommuting. In the analyses also routing infor- being disconnected, the larger is the impact of the
mation is considered, and related to submarine ca- network outage. This approach is as simple as lim-
ble cuts, reporting the earthquake impact in terms ited: although mere connectivity is the necessary
of link-state neighbor events per unit time: related condition for an end-user, also the perceived net-
but more complex and formal metrics are described work performance matters.
5.3. Network-centric impact evaluation traffic load of every AS. To estimate the traffic load
of an AS, they proposed a betweenness centrality
A few other works evaluated the impact of an metric computed over the AS-level topology graph
outage introducing network-related metrics. An ap- of the Internet extracted from BGP data. Between-
proach commonly adopted is comparing the net- ness centrality of a node is the number of short-
work status and its performance before, during, and est paths from all vertices to all others that pass
after the outage. In this way, researchers aim at through that node, thus measuring the centrality
identifying, modeling, and quantifying the pertur- of the node in the network. Betweenness central-
bation caused by the outage. As we discuss in the ity proved to be a more meaningful measure than
following, works available in literature may strongly connectivity for both the load and importance of
differ in how this goal is achieved in practice. a node [95]: betweenness centrality is more global
In I-seismograph, Li et al.[51] relied on BGP data to the network, whereas the latter refers to a local
to characterize the network status. They defined effect. The key idea in [154] is to use betweenness
the impact of an outage in the Internet as any de- centrality to measure how many AS paths traverse
viation from BGP normal profile. This deviation a certain AS, since this will be directly related to
can be described in terms of a magnitude and a the amount of traffic transferred by that AS. The
direction. The magnitude represents an absolute, authors proposed approaches to evaluate the ag-
quantitative evaluation of the outage intensity. The gregated time of route changes, the worst affected
direction provides deeper insight, since it indicates components, etc. Banerjee et al. [44] also discussed
which BGP attribute(s) deviates the most from nor- the limitations of using the connectivity to evalu-
malcy. I-seismograph exploits a vectorial approach ate the impact of an outage: the authors proposed
to evaluate magnitude and direction of the outage. a new metric called Region-Based Largest Compo-
To evaluate the impact of an outage, Wu et nent Size (RBLCS) related to the size of the largest
al. [247] defined two families of network-centric met- network connected component once all the nodes of
rics: (i) reachability impact metrics (RIMs) and (ii)
tion that the traffic traversing failed links is shifted second considers the terminal reliability measuring
to new paths after an outage, and this may cause the effect on the global connectivity of the network
For this reason, the authors introduced the con- degradation and it is based on the correlation be-
cept of “link degree”, that is, the number of short- tween loss and delay. In [79], the impact of an
est policy-compliant paths traversing a link. Based outage is measured by considering how many IP
on this concept, they introduced three TIMs: (1) addresses in the affected geographical area likely
an absolute TIM, which is the maximum increase lost connectivity. This analysis is based on passive
of link degree among all links; (2) a relative TIM, monitoring of data plane unsolicited traffic coming
which is the ratio between the absolute TIM and from the affected region. Note that this is a rela-
the link degree of the new path (after rerouting); tive measure, because it is compared to the number
and (3) an evenness TIM, which is the ratio be- of IP addresses that were visible before the outage.
tween the absolute TIM and the link degree of the Furthermore, this information can be used to infer
failed path. This metric captures the evenness of another measure, that is, the maximum radius of
re-distributed traffic for the failed link. the impact. This approach adopts both a user- and
Similarly, Liu et al. [154, 153] proposed an impact a network-centric points of view and it was further
evaluation methodology based on the changes of the explored in [46] where other IBR-derived metrics
are proposed to gain insight into macroscopic con- 6.1. Definitions of Resilience
nectivity disruptions.
The term resilience has its origin in the Latin
5.4. Discussion on outage impact evaluation word resiliere, which means to bounce back. Some
The vast majority of the outage-related stud- studies tried to address the lack of standardization
ies only performed a qualitative or nonformalized and rigour when quantitatively defining resilience
evaluation of the outage impact providing a quick in general by reviewing the different existing def-
insight on the consequences of the outage. The initions, concepts, and approaches. For instance,
largely incomplete and scattered information pro- Henry et al. [201] showed how the general concept
vided by similar studies only poorly contributes to- of resilience has been variably declined in different
wards an exhaustive and accurate understanding contexts. In physics, for instance, resilience is the
of the possible consequences of network disruptive ability of a material to resume its natural shape
events. From this point of view, the few works after being subjected to a force. In sociology, it is
proposing formally defined user- or network- cen- the ability of a social system to respond and recover
tric metrics appear to be extremely valuable. In- from disasters [126, 207].
deed, the proposed metrics are non-ambiguous and Regarding communication networks, Sterbenz et
can be adopted to investigate and compare differ- al. [224] defined resilience as the ability of the net-
ent network outage events. Unfortunately, we no- work to provide and maintain an acceptable level of
ticed the lack of a widely accepted framework of service in the face of various faults and challenges
formally defined metrics: each work proposed or to normal operation [224]. Whitson et al. [202] pro-
vided two complementary definitions (i) a static re-
adopted different metrics causing the Internet out-
ages documented in literature to be very hard to silience, which is “related to the ability of an entity
compare. As we deepen in Sec. 9, this heterogene- or system to maintain function when shocked”; and
ity of metrics represents a key open issue. (ii) a dynamic resilience, which is “related to the
speed at which an entity or system recovers from
work operators and administrators to understand ery. Finally, Cholda et al. [65] introduced the con-
whether or not their infrastructure is enough re- cept of “Quality of Resilience” (QoR), to correlate
resilience with the methodologies adopted to esti-
from the normal or correct operational status of 6.2. Evaluating network resilience
the network. One way to achieve fault tolerance is
Network resilience has very often been evaluated
the use of redundancy. We discuss redundancy and
only qualitatively, i.e. no formal metric was intro-
other outage countermeasures in Section 8.
duced or adopted. This is a common limitation
Reliability. Commonly used in the design, deploy- of outage-related studies where qualitative evalu-
ment, and maintenance of critical systems, the reli- ations are much more common than quantitative,
ability of a network describes the probability of not structured evaluations. For instance, Cho et al. [62]
observing any failure in a certain time span. Hence, argued that, during the Japanese earthquake, “de-
reliability quantifies the continuity of proper service spite many failures, the Internet was impressively
and it is sometimes implicitly used in outage-related resilient to the disaster”. Interesting, Sterbenz et
studies. For example, in [169, 34] the network re- al. [224] argued that “it is widely recognised that
silience under geographically correlated failures is the Internet is not sufficiently resilient, survivable,
evaluated by calculating the average two-terminal and dependable, and that significant research, de-
reliability. Using the average two-terminal reliabil- velopment, and engineering is necessary to improve
ity is quite a common approach, especially in graph- the situation”. Being able to evaluate how much
theoretical works. However, Segovia et al. [213] the network is resilient against disruptions is an
argued that, when considering connection-oriented essential next step toward a more comprehensive
networks, a two-terminal reliability metric is not understanding of network outages and their miti-
appropriate to assess the capability of the network gation.
to guarantee connections. For this reason, other works formally introduced
Availability. Availability is a concept closely re- metrics and approaches to quantify the network re-
lated to reliability, and in [25] is defined as “the silience.
ability of a system to be in a state to perform a
6.2.1. Works introducing formal metrics for re-
availability, [35] argues that “availability is reliabil- such as the average shortest path length, the largest
ity evaluated at an instant” (instead of over a time component size, and the number of connected node
interval). Similar metrics are adopted to measure pairs in the network. The authors also consider
availability and reliability. the routing policy-compliant directed graph mod-
elling the analyzed network. Whitson et al. [202]
Elasticity. Introduced by Sydney et al [227], the
proposed a probabilistic model for evaluating the
elasticity of the network is formally defined as the
disruption caused by an outage. The authors ap-
Survivability. In [224], survivability is defined as plied this approach to GMPLS-based transport net-
the capability of a system to fulfill its mission, in works. In [201], Henry et al. proposed a resilience
a timely manner, in the presence of threats such metric as a time-dependent function describing the
as attacks or large-scale natural disasters. Castet ratio of recovery at a given time from an outage
et al.[127] observed how resilience and survivabil- suffered by the system at a certain point in the
ity are interchangeably used according to the spe- past. In [172], a fuzzy architecture assessment for
cific context of a given study. Usually, in network- critical infrastructure resilience is presented. Fi-
oriented studies, survivability is seen as a static nally, Gorman et al. [106] proposed distance based
component. For further details on survivability, the approaches for identifying critical nodes and links
reader may refer to [114] [139]. in communication networks and evaluated them
through simulations. In this pioneering work, the find (i) the vulnerable points within the network
authors demonstrated the importance of the struc- in case of single and multiple disasters; and (ii)
tural properties of small world and scale free net- the points responsible for the most significant de-
works. They also preliminary explored a method struction. Interestingly, they also proposed a simple
for analyzing the interactions of physical and log- probabilistic model in which the probability of each
ical networks demonstrating how these are depen- network element failure is given. This is due to the
dent on at both a micro and macro level: although fact that a network element does not necessarily
the database of national data carriers adopted in fail, even when it is close to the outage epicentre.
this paper is now largely outdated, this conclusion They proposed two metrics: (i) the number of link
appears still valid nowadays. failures caused by the outage; (ii) the two-terminal
Wang et al. [244] also proposed a probabilistic
6.2.2. Works focussing on resilience-related metrics outage model and defined metrics and methods to
Other works faced the problem of quantifying assess network vulnerability. They argued that “ne-
the resilience of a network relying on operations glecting probabilistic behavior of a region failure
research and graph theory techniques. They also may significantly over-estimate or under-estimate
commonly refer to the vulnerability of the network its impact on network reliability”. The authors pro-
instead of its resilience. To the best of our knowl- posed three metrics to assess the network vulnera-
edge, no formal and shared definition of vulnerabil- bility: (i) the remaining link capacity, i.e. the ex-
ity exists in literature in the context of outages (and pected capacity of all remaining survived links; (ii)
most confusingly, in the context of security it is de- the pairwise capacity reduction, i.e. the expected
fined as “an internal fault that enables an external decrease in traffic between a pair of given nodes;
fault to harm the system” [41], i.e. as a sub-type of and (iii) the pairwise connecting probability,i.e. the
threat, not as a system attribute). Difficulties and probability that a pair of given nodes with path
i.e. a minimally vulnerable network is a network formance metrics based on geometric probability.
with maximum resilience, and vice versa. We dis- They evaluate network robustness by calculating
cuss some of these works in the following. the average two-terminal reliability of the network
Usually, works focussing on evaluation of network nodes. In [175], Neumayer et al. assessed the vul-
vulnerability (i) propose a graph representation of nerability of the fiber infrastructures to disasters,
the network, (ii) define an outage model, and (iii) exploiting a graph model in which nodes and links
destroyed by the outage and it is removed from the 175, 44, 83, 243, 170, 90]. These are general stud-
graph. ies not strictly focused on IP networks, although
For example, Li et al. [152] assessed the sur-
6.2.3. Resilience of data-center networks loss. Risk assessment methodologies are commonly
Finally, a topic that has recently gained atten- employed in safety engineering and reliability engi-
tion is the resilience of data-center networks. This neering studies. On the other hand, this concept
is mainly due to the strict Service Level Agreements has not been deeply examined in computer net-
(SLAs) they have to meet, and thus the involved works engineering so far.
economic aspects. Most of current literature in this We believe that further investigations should be
field mainly focused on single failures of links, com- carried on. There are two main reasons for this.
putational elements, network devices, etc. inside Firstly, the Internet has become a critical infras-
the data center network (e.g., [104, 174]). Quanti- tructure, which is something we have already dis-
fying the data center network resilience to outages cussed in Sec. 1. Many risk assessment stud-
ies have been performed on other critical infras-
occurring in the public Internet received little at-
tention. tructures (e.g., electric power systems). The same
The evaluation of resilience for data center sys- should be done for the Internet. Secondly, risk as-
tems is addressed in [136], in which Khalil et al. sessment methodologies can be even more useful in
proposed a general resilience metric framework, smaller, private networks. Suppose in fact that a
based on monitoring efficiency features which would company wants to insure its network infrastructure:
be impacted by an outage. this can not be done if insurance companies do not
In [101], Ghosh et al. attempted to quantify re- know how to perform a risk assessment study on a
silience of IaaS cloud [13]. Their definition of re- computer network.
silience includes the notion of change. They con-
sider two types of changes: (i) changes in client 7.1. Risk Assessment in IP networks
demand (e.g., job arrival rate), and (ii) changes in The problem of assessing the outage-related risk
system capacity (e.g., the number of available phys- in IP network is not commonly addressed in litera-
ical machines). Ghosh et al. proposed to quan- ture. In this paragraph, we aim to provide a brief
tify cloud resilience in terms of effect of changes presentation of studies that somehow try to address
on two performance-based quality-of-service (QoS) this problem, so to understand the current state of
metrics: the job rejection rate and the provision- the art.
ing response delay. Their analysis is based on a An important contribution on risk assessment in
Service Providers (CSPs) exploit geo-distributed cations, and designing risk-aware networks. The
networks of data-centers. Geo-diversity can en- authors propose a three-dimensional scheme to de-
hance performance (e.g., service delays) and in- scribe the complexity of Internet networking for risk
crease reliability in the presence of an outage im- framing, with the dimensions being
Risk assessment is “the determination of quanti- layering, but for the presence of an Over-
tative or qualitative value of risk related to a con- lay layer between Network and Application
crete situation and a recognized threat” [129]. A (service);
risk is made up of two essential components: (i) the • market elements—related to resources shared
magnitude of the potential loss, and (ii) the prob- within the system of different network
ability that the loss will occur. In RFC 4949 [216] providers, e.g. physical infrastructure, equip-
the concept of acceptable risk is defined. Specif- ment, peering, outsourcing, end-user services.
ically, a risk is acceptable if it is understood and
tolerated, usually because the cost or difficulty of Moreover, they compare communication networks
implementing an effective countermeasure for the with networking systems such as aviation and rail-
associated vulnerability exceeds the expectation of ways, highlighting analogies and differences with
impact on risk analysis. The basis for risk analysis connection that incorporates the societal or mone-
is then founded on metrics for network reliability tary effects of the loss can be determined.
and availability of repairable systems. These are In [218], Silva et al. proposed an architecture for
linked with network failure models and loss estima- risk analysis in cloud environments. Specifically,
tion models. Finally, drawing from finance theoreti- they propose a model in which the cloud consumer
cal models, the risk is expressed in terms of business (CC) can perform risk analysis on a cloud service
consequences for different actors, namely providers, provider (CSP) before and after contracting the ser-
users, regulators and researchers. This way the au- vice. The proposed model establishes the responsi-
thors of [66] are able to propose a method to map bilities of three actors: the CC, the CSP, and Infor-
events affecting network functionality onto a quan- mation Security Labs (ISLs). This third actor is an
tity expressing the economic risk of the network op- agent that represents a public or private entity spe-
erator. cialized on information security (e.g., an academic
In [233], Vajanapoom et al. formulated three or private laboratory). The authors claim that in-
risk management techniques for the design of a re- clusion of this actor makes the risk analysis more
silient network, based on (i) the minimization of credible to the CC.
the maximum damage that could occur in the net- In this architecture, five risk analysis variables
work, (ii) the minimization of the maximum risk are proposed: (i) the Degree of Exposure (DE),
in the network, and (iii) the minimization of the which defines how the cloud environment is exposed
root mean square damage. The paper proposes a to certain external or internal threat; (ii) the De-
risk assessment methodology that is functional to gree of Disability (DD), which defines the extent
the aforementioned risk minimization techniques. to which the cloud environment is vulnerable to a
Furthermore, Vajanapoom et al. [233] [234] also particular security requirement; (iii) the Probabil-
adapted the risk concept to networked environ- ity (P), which defines the probability of an incident
ments. Specifically, they use the concept of network occurrence, (e.g., a threat exploiting a vulnerabil-
state as a tuple in which the i-th element specifies ity); (iv) the Impact (I), which defines the potential
whether the i-th network component is in a failure loss in the event of a security incident; and (v) the
state or not. Thus, there are a total of 2n possible Degree of Risk (DR), which defines the degree of
network states (i.e., failure scenarios). Then, the risk for a given scenario of a security incident. Their
risk associated with a network state s is equal to risk analysis works in two well-defined phases: (i)
the product of the probability of the network being the risk specification and (ii) the risk assessment.
in state s and the amount of damage occurring in The former defines and quantifies threats, vulner-
network state s. Since all network states are mu- abilities, and information assets that will compose
tually exclusive to each other, the overall network the risk analysis, whereas the latter consists in the
risk is equal to the sum of the risks associated with quantification of the aforementioned five variables.
each network state over all states. Then, two pieces The architecture provides a language for the specifi-
of information are needed: (i) the probability of a cation of risk, the Risk Definition Language (RDL),
state and (ii) the amount of damage that corre- specified in XML. This language is used by ISL
sponds to that state. According to the authors, the to specify threats and vulnerabilities, and contains
evaluation of a state probability can be determined information such as the risk ID, the ISL ID, the
by opportunely multiplying the appropriate failure threat/vulnerability ID, and reference to a Web Ser-
probabilities of all network components. vice Risk Analyzer (WSRA), which is a web service
The damage evaluation can be measured in dif- specified by ISL to perform the quantification of the
ferent ways. In connection-oriented networks, such DD and the DE. Several other studies deal with risk
as WDM and MPLS networks, they argue that it analysis in cloud systems as well; the reader may
is natural to consider the amount of damage asso- refer to [171, 242, 98, 238].
ciated with the loss of each end-to-end connection A risk assessment model for Optical Backbone
due to network failures. Therefore, the amount of Networks is proposed in [84], where the authors de-
damage that corresponds to a certain network state velop a probabilistic model to analyze the penalty
is the sum of the amounts of damage of all failed from service interruption due to a disaster. The
connections in that network state s. They also outage causes considered in this work are hurri-
claim that, if information on the traffic is available, canes, earthquakes and Weapons of Mass Destruc-
a damage metric associated with each end-to-end tion (WMD), covering the categories both natural
and human-originated, but limited to physical dam- common in other engineering fields and are usually
ages, of type unintentional (the supposed targets of focused on a specific outage, e.g. earthquakes. In
WMD are based on city population and city impor- fact, different outages will cause the network com-
tance, not intentionally aiming to the distruction of ponents to react in different ways. For example,
the network). Failures refer to link failures, and in in case of earthquakes, we are interested in know-
case of network device failure this is translated to ing the response to shocks and vibrations of every
failure of the links connected to the device. The risk network component, so to opportunely mix this in-
model is used to inform a preventive deployment of formation with the earthquake propagation data to
backup links, so that the overall cost of rendun- understand what happens to each component and
dant infrastructure and potential losses/penalties to the whole network. In case of hurricanes, dif-
are minimized, leading to a formulation equivalent ferent figures of mechanically reliability will be re-
to the Integer Linear Programming arc-flow multi- quired. In case of logically-disruptive outages, such
commodity problem. The risk model is also used physical-level figures would be useless, and different
to deal with Correlated Cascading Failures (CCFs), models need to be developed.
where the probability of further failures depend on Furthermore, when assessing the risk, it should
the first one, to propose a reactive Traffic Engi- be necessary to also take into account the in-
neering solution aimed at mitigating and recovering terdependent response of correlated systems, e.g.
from the disaster. telecommunication and electric power systems. An
example is provided in [147], in which the focus is
7.2. Risk Assessment and Resilience on seismic hazards. Models that consider multiple
The concepts of resiliency and risk assessment are interdependent networks have been introduced in
related with each other. In [102], a risk assessment literature and can be used to perform risk assess-
methodology for networked critical infrastructure is ment: we refer the reader to [241] and works cited
presented. This methodology is made of two prin- therein.
the links between resilience and risk assessment can (fast) recovery mechanisms, to bring the system
be found in the “Resilience and Risk assessment for back to a fully operational state. According to the
Critical Infrastructures” workshop [94]. In this, it literature, recovery mechanisms are necessary to en-
is said that “The concept of resilience can be seen sure network resilience. As we deepen in the follow-
as a superset in which typical risk assessment is a ing, recovery mechanisms can be either reactive or
complementary part.” proactive; they can also be progressive, especially in
case of physical disruption of network components.
7.3. Discussion We provide a general overview of recovery mecha-
So far, we have discussed the notion of “risk as- nisms in the following, referring the readers to [172]
sessment”. We motivated the need for further in- for more details about metrics and taxonomies on
vestigations in this field and presented the current recovery mechanisms.
state-of-the-art. Only few risk assessment studies Fig. 7 reports a simple “outage model” as in-
have been made in the context of computer net- troduced by Henry et al. in [201]: a generic sys-
works. On the other hand, these studies are quite tem affected by an outage is modelled as a Finite
in [118], network recovery mechanisms can be cat-
egorized into reactive and proactive mechanisms.
In reactive recovery mechanisms, after a failure is
detected, network nodes re-run the routing algo-
rithm and exchange information for the routing to
converge. This can take quite a long time, espe- Figure 7: An outage model, as introduced by Henry and
cially in BGP, where convergence times are gen- Ramirez-Marquez in [201].
erally long. On the other hand, in proactive re-
covery mechanisms, (i) some network failures are
assumed, and (ii) corresponding recovery settings key aspects: (i) a component recovery mechanism,
are pre-calculated and distributed among network which describes policies for restoring or repairing a
elements, so that, in case one of that failures is de- disrupted component, and (ii) an overall resilience
tected, the recovery mechanism immediately selects strategy, which is related to implementing compo-
one of the pre-calculated settings (the one that cor- nent recovery mechanisms at the system level.
responds to the detected failure). This mechanism An approach encompassing both proactive and
aims at reducing the convergence time required by reactive aspects is the autonomic one, character-
routing protocols. However, in [118] Horie et al. ized by a control loop including automatic monitor-
claimed that ‘when the failure has not been con- ing, analysis, planning, and execution phases [135].
sidered in the pre-calculation, the recovery mech- Such approach has been proposed e.g. for threat
anism cannot completely recover from the failure.’ mitigation in the Internet of Things [38].
Hence, real-time outage detection techniques identi-
fying possible network failures are essential for such 8.2. Outage Solutions
mechanisms to properly operate. Sometimes a dif-
In this section, we discuss the solutions proposed
be only progressively restored over time. The au- In this paragraph, we discuss solutions that have
thors noticed how different recovery processes will been proposed in literature to deal with outages by
result in different amount of network capacity in- mainly operating at the physical layer. These so-
crease after each stage due to the limited available lutions often outline design choices such as redun-
repair resources. In [112], Hansen et al. proposed dancy and how to exploit it in an effective manner.
a differentiation based on the scope of the recov- At this level, recovery requires the detection of
ery: they defined global recovery mechanisms cov- the damaged components and their replacement or
ering link and node failures by calculating a new fixing. It is important to define effective strategies
end-to-end path, and local recovery mechanisms, in to restore the service as soon as possible. For this
which failures are handled locally by the neighbor reason, we will focus on proactive recovery mecha-
nodes. In [201], the recovery action consists in two nisms (or protection schemes).
Redundancy. Redundancy is a key design feature if a path is disrupted and no other path is avail-
for fault-tolerant systems. Without a certain degree able, the only possible solution consists in repair-
of (physical and/or logical) redundancy, resiliency ing the disrupted path. If however another path is
can never be achieved. available, further techniques must be employed to
In [97], Fukada et al. analyzed the impact of the quickly and effectively recover from the outage, be-
Japanese earthquake on an important national net- cause (i) the idea of realizing fully-redundant net-
work. They observed that, even though some phys- work systems is not applicable (for economic rea-
ical links were damaged, the network connectivity sons), and because (ii) redundancy is weak against
was maintained thanks to two levels of network re- correlated failures that an outage would be likely
dundancy, (i) a physical link level redundancy, and to cause.
(ii) a network topology level redundancy. The for- Link Prioritization. Segovia et al. [213] pro-
mer is guaranteed by dual physical links that route posed a protection scheme based on link failures. A
along different geographical paths, whereas the lat- network node can fail partially or totally, according
ter is provided by redundant multiple loops in the to the status of its links. They supposed that all
network topology. Cho et al. [62] analyzed the same the links in a network have equal probability of be-
outage on a different network. They emphasized the ing hit by a certain failure, and that it is possible
importance of redundancy and over-provisioning in to make them invulnerable at a fixed cost per link.
the network design as well. In case of outages, several links will be affected at
Nonetheless, several works pinpoint limitations once; assuming that only a fixed budget is avail-
of redundancy. In [213], Segovia et al. argued that able for shielding links, only a limited number of
usually redundancy-based techniques are effective them can be made invulnerable. Accordingly, the
under single-failure scenarios rather than for out- authors proposed an optimization model to decide
ages, since the cost of implementing massive re- which links should be part of the set of invulner-
dundancy for rarely occurring events is prohibitive. able links. Obviously, “invulnerability” is only an
backup link from coming up. They argue that these torial and non-deterministic nature of the problem
failures are non-transient in nature and can only be requires the introduction of approximate solutions.
resolved by the intervention of a network opera- For this reason, in [213] Segovia et al. proposed two
tor. In [247], Wu et al. discovered that, in spite heuristic-based approaches to the problem, whose
of the apparent physical redundancy, a large num- common purpose is to produce a prioritized list of
ber of ASes are vulnerable to a single access link links to make “invulnerable”, from which to choose
failure; furthermore, BGP policies severely further according to the available budget.
limit the network resilience under failure. They find Progressive Recovery Mechanisms. The pro-
that about 35% of the ASes can be disconnected gressive recovery mechanism proposed by Wang et
from the rest of the network by a single link failure, al. in [240] is somehow complementary to link prior-
which they claim to be the most common failure itization. The authors proposed an optimal recov-
in today’s Internet. An outage can thus simultane- ery mechanism to progressively restore the network
ously disrupt a large amount of stub ASes. capacity under fixed budget constraints. This is
In [243], Wang proposed two optimization prob- MPLS TE tunnels; these can be set in a reactive
lems: the first problem considers effective connec- or proactive manner, and usually work by adding
tion recovery when a disruptive event happens, one more labels to the packets traveling on a
whereas the second one studies network augmenta- primary tunnel, in order to divert traffic onto the
tion to build a resilient network against any single bypass tunnel. However, as observed in [245] [24],
region failure. Wang showed how also these prob- while MPLS-FRR is the major technique currently
lems are NP-hard and require heuristic algorithms deployed to handle network failures, practical
to be solved. limitations still exist, in terms of complexity,
Risk-based Resilient Network Design. In congestion, and performance predictability.
[234], Vajanapoom et al. proposed a risk-based
resilient network design. The main design problem
taken into account is: given a working network 8.2.3. Solutions at the Network/Application Layers
and a fixed budget, how to best allocate the In the following paragraphs, we present solutions
budget for deploying a survivability technique in that directly operate at the network layer or at the
different parts of the network based on the risk application layer, typically making use of overlay
management. The authors proposed four risk man- networks. Some solutions operate at both layers.
agement based approaches for survivable network Resilient Routing Reconfiguration (R3). At
design: (i) minimum risk; (ii) minimum-maximum the end of the previous paragraph we have de-
damage; (iii) minimum-maximum risk; and (iv) scribed challenges that have still be addressed in
minimum-RMS damage survivable network. MPLS-FRR. In [245], Wang et al. argued that, in
early 2010, two of the largest ISPs in the world gave
instances of severe congestion caused by FRR in
8.2.2. Solutions at the Data-Link Layer their networks. Motivated by the aforementioned
In this paragraph, we present solutions that limitations, Wang et al. [245] proposed Resilient
mainly operate at the data-link layer. We focus on Routing Reconfiguration (R3), a novel routing pro-
the adaptation SONET/SDH-like resilience tech- tection scheme. R3 can quickly mitigate failures by
niques to IP networks. These solutions can be set pre-computing forwarding table updates for the loss
in a reactive or proactive manner. of each link. They argue that R3 is (i) congestion-
In [226], Suwala and Swallow argue that IP traf- free under a wide range of failure scenarios; (ii) ef-
fic can be protected using techniques below layer 3 ficient w.r.t. router processing overhead and mem-
bundling (IPIB), and MPLS fast reroute (MPLS- proach: R3 strongly depends on a novel technique
FRR) are covered. The key motivation is that layer for covering all possible failure scenarios with a
3 solutions are limited by the need to communi- compact set of linear constraints on the amounts
cate among multiple routers, whereas the aforemen- of traffic that should be rerouted. The authors for-
tioned techniques are not subject to this constraint. mulate a linear programming model to characterize
APS and RPRP are mechanisms that aim at ex- optimal rerouting, and implement R3 protection us-
ploiting redundant paths in a fast and efficient way. ing MPLS-ff, a simple extension of MPLS, while the
These are based on the existence of protection links. base routing can use either OSPF or MPLS. The
IP interface bundles are used to group several phys- authors implemented R3 on Linux and claim that
ical link into a single virtual link, i.e. a logical link. their Emulab evaluations and simulations based on
If one or more physical links fails, traffic can be real Internet topologies and traffic traces show that
quickly shifted to other links in the bundle. This R3 achieves near-optimal performance.
mechanism is transparent to the routing protocol. BGP Modifications. Several studies proposed
Nevertheless, disruptive outages are likely to cause modifications to BGP, based on the observation
the failure of all the links in the bundle. that it is the de-facto standard for inter-domain
MPLS-FRR aims at repairing damaged tunnels routing in the Internet. In the previous sections,
by creating a “bypass tunnels” that replace the we only considered BGP as an analysis tool. In
failed links. Bypass tunnels simply represent other this paragraph we report studies for which BGP is
the object of the analysis instead. The perspective an AS to avoid problems on its forward paths to
is this: after the occurrence of an outage, BGP is destinations, but little control over the paths back
responsible for the recovery of interdomain connec- to the AS is provided. LIFEGUARD provides re-
tivity (exploiting a reactive recovery mechanism); verse path control through BGP poisoning. Specif-
is it effective enough? Is it fast enough? There is ically, the origin AS insert the (partially) disrupted
a huge amount of studies, in literature, that cover AS into its path advertisements. This way, it ap-
the problem of BGP convergence times after the pears that the disrupted AS has already been vis-
occurrence of a (large) failure; for example, see ited. When the announcements reach the disrupted
[208, 210, 209], [151, 109, 50, 188]. AS, BGP’s loop-prevention mechanism will drop
Another issue often raised in literature is related the announcement. Hence, networks that would
to the fact that BGP is still based on ‘the honor have routed through the disrupted AS will only
system’ [70], that is, any organization on the Inter- learn of other paths. The authors show that LIFE-
net can easily assert that it owns the IP addresses GUARD’s rerouting technique finds alternate paths
of any other organization, and it is up to the re- 76% of the time. This mechanism seems to pro-
ceivers of these BGP updates to decide whether to vide two main advantages: (i) the networks that
trust the information or not. Hence, a third ques- use LIFEGUARD are less affected by outages in the
tion on BGP can be considered: is it secure enough? rest of the Internet, and (ii) the amount of traffic in
The answer is easily no. Several BGP security vul- the disrupted area decreases, so that the “survived”
nerabilities are presented in RFC 4272 [173], and network capacity can be better exploited by people
although there are proposals for secure BGP ver- in the affected area.
sions (e.g., BGPSec [149]), it is easy to understand RiskRoute. In [91], Eriksson et al. proposed
the difficulties that arise when trying to actually RiskRoute, a routing framework for mitigating net-
introduce them in the Internet. A survey on se- work outage threats. The authors introduce the
curing BGP is provided in [123]. Furthermore, il- concept of bit-risk miles, the outage risk weighted
licit prefixes could be imported/exported also as a distance of network routes. This measure is pro-
consequence of misconfigurations rather than inten- posed with respect to four properties: (i) geo-
tional attacks. For example, Mahajan et al. [158] graphic distance; (ii) outage impact; (iii) histori-
showed that misconfiguration errors are pervasive, cal outage risk; (iv) immediate/forecast outage risk.
with 200-1200 prefixes suffering from misconfigura- RiskRoute is a routing framework based on the def-
tion each day. inition and opportune use of bit-risk miles. Specif-
In conclusion, BGP has still some problems. It ically, the objective is the minimization of the bit-
may be slow to converge and it is not secure at all. risk miles of routes in a network infrastructure.
Modifications to BGP have been often proposed in RiskRoute can be used to reveal the best loca-
literature, but the reader will easily understand the tions for provisioning additional PoP-to-PoP links,
difficulty of modifying the way routers work in the or new AS peering connections that would be ad-
whole Internet. visable to establish, etc.
LIFEGUARD. In [134], Katz-Bassett et al. pro- Eriksson et al. assessed and evaluated
posed LIFEGUARD, a system for automatic fail- RiskRoute, determining the providers that have the
ure localization and remediation. It uses active highest risk to disaster-based outage events. They
measurements and a historical path atlas to locate are also able to provide provisioning recommenda-
faults. The authors propose an approach to quickly tions for network operators that can in some cases
bypass disrupted areas. The key idea is to give significantly lower bit-risk miles for their infrastruc-
data-centers and other well-provisioned edge net- tures. Under this perspective, it can be used as a
works the ability to repair persistent routing prob- high-level resilient routing framework. Further con-
lems, regardless of which network along the path siderations on resilient routing frameworks can be
is responsible for the outage. If some alternative, found in [189], in which Pei et al. provided a survey
working policy-compliant path can deliver traffic on research efforts in the direction of enhancing the
during an outage, the data center or edge network dependability of the routing infrastructure.
should be able to cause the Internet to use it. The Geographically Informed Inter-Domain
interesting characteristic of LIFEGUARD is that Routing (GIRO). In [179], Oliveira et al.
it is actually deployable on today’s Internet. The proposed a new routing protocol and address
authors argue that existing approaches often allow scheme, called GIRO (“Geographically Informed
Inter-Domain ROuting”). GIRO uses geo- ing protocols scale relatively well, but do not re-
graphic information to assist (and not replace) act quickly to changing network conditions. On
the provider-based IP address allocation and the contrary, overlay networks can quickly respond
policy-based routing. The authors argue that, to changing network conditions, but they do not
incorporating geographic information into the IP scale very well, since they rely on aggressive prob-
address structure, GIRO can significantly improve ing. UFO stands for Underlays Fused with Overlays
the scalability and performance of the global to stress its two-layered architecture. It provides
Internet routing system. Within the routing policy the abstraction of a subscription service for net-
constraints, geographic information enables the work events occurring along the underlying paths
selection of shortest available routing paths. The between the overlay nodes. Explicit cross-layer no-
authors argue that traversing longer distance (and tifications helps improving the efficiency of reactive
more routing devices) is likely to increase the routing at the overlay layer without compromising
chance of outage, as well as other performance scalability, since notification messages are propa-
metrics. For this reason, we presented GIRO in gated only to the participating overlay nodes.
this context, even though its main focus is not on Further studies on RRLs have been proposed; the
proposing an outage-aware routing scheme. reader may also refer to [142, 143]. In this para-
Resilient Routing Layers (RRL). Resilient graph we presented the concept of RRL. We have
Routing Layers (RRL) were firstly introduced in presented a ‘native RRL’ solution [111], an ‘overlay-
[112] by Hansen et al. Given a network topology, based’ solution [118], and a compromise between
RRL assumes a certain number of failures in the the two [250]. Pros and cons of the different ap-
network node(s). Each failure scenario is associ- proaches were presented.
ated with a different topology, that can be derived Network Resilience through Multi-Topology
from the original one. On this, RRL pre-calculates Routing. In [167], Menth and Martin propose the
an opportune routing table. These are called Rout- use of Multi-Topology Routing (MTR) to achieve
ing Layers (RL). Each RL attempts to configure a higher network resiliency. MTR is an optional
the network topology so to re-route traffic preserv- mechanism within IS-IS, used today by many ISPs.
ing the reachability of other parts of the network. It provides several different IP routing schemes
All nodes in the network share the calculated RLs, within one network. In [167], Menth and Martin
and select the same single RL when a network fail- enhance MT routing to provide resiliency. The key
ure occurs. RRLs represent an example of proac- idea is simple: under normal network conditions, a
tive recovery mechanism, because “backup routes” basic MTR scheme is used. If a node detects the
are pre-calculated. In [111], Hansen et al. demon- outage of one of its adjacent links or neighbor node,
strated how their RRL method can be used as a it deviates all traffic that has to be sent according
tool for recovery from outages. to the routing table over this failed element to an-
RRL with Overlay. An adaptation of RRLs to other interface over an alternative routed provided
accommodate large-scale failures is also provided by a another MTR scheme. Under this perspective,
for example in [118] by Horie et al., who proposed MTR is somehow similar to RRL. The authors ar-
an overlay network approach. In fact, they argue gue that their solution can guarantee a failover time
that using an overlay network is convenient for dif- comparable with MPLS solutions based on explicit
ferent reasons: first of all, methods based on overlay routing.
networks can be easily and quickly deployed, since Resilient Overlay Networks (RON). Resilient
no standardization process is needed. Furthermore, Overlay Networks (RON) have been first proposed
they claim that the application-level traffic rout- in [37]. They rely on the overlay networks advan-
ing performed by overlay routing can overcome the tages discussed in the previous paragraphs. A RON
shortcomings in policy-based BGP routing. allows Internet applications to detect and recover
Underlays Fused with Overlays (UFO). In from path outages within several seconds, whereas
[250], Zhu et al. proposed UFO, a Resilient Layered it can require several minutes to current wide-area
Routing architecture. The authors argue that com- routing protocols. RONs have a relatively simple
mon routing protocols and overlay networks have conceptual design: several RON nodes are deployed
their pros and cons. Therefore, they propose an at various locations on the Internet and form an
architecture that tries to achieve the best of both application-layer overlay that cooperates in routing
worlds. In fact, they argue that common rout- packets. Since a RON provides a classic overlay ar-
At the same time, several studies (e.g., [176])
demonstrated how the use of CDNs can actually
help preventing or mitigating pontential Internet
outages caused by malicious behaviours. Indeed,
Figure 9: The Resilinets strategy as shown and described in
CDNs typically replicate contents across different [224].
geographically distributed servers. When a user
requests a content, the CDN typically redirects
this request towards the best server according the resilience of the network; (v) assess the over-
to different policies. In this way, the CDN can all risk of the infrastructure to this type of disrup-
dramatically mitigate the consequences of DDOS tive events; (vi) recover from them or prevent and
attack, since the high number of requests generated mitigate their impact.
by the compromised machines would be load bal- Based on our elaboration of the literature, we
anced across different servers. This complementary highlight here the open issues requiring additional
research on the short and long term. In Tab. 6 we
service provided by CDN is becoming more and
more critical in today networks. map such open issues on the main challenging prob-
lems that informed the structure of our literature
research. It is evident that some of the open issues
8.2.4. Overall Resilient Strategies span the whole spectrum of Internet outages chal-
At the highest level, we can define overall re- lenges (namely, the lack of common methodologies
silient strategies, that provide recommendations and metrics), while most regard one or few aspects
and guidelines to obtain a more resilient network. alone. In the following we cover each open issue in
network resiliency. Resilinets is based on a number 9.1. Common definitions and metrics
of axioms, strategies, and principles. Basically, Re-
silinets considers a two-phase strategy called D2R2 While a structured approach to failures of digi-
+ DR, as shown in Fig. 9. The first active phase, tal systems at large has been presented time ago
D2R2, stands for “defend, detect, remediate, re- ([217], currently at the 3rd edition, and [41]), still
cover”; it is the inner control loop and describes a much variability is present in terminology, specif-
set of activities that are undertaken in order for a ically reagarding failures in networking and Inter-
system to rapidly adapt to challenges and attacks net outages. The lack of widely accepted defini-
and maintain an acceptable level of service. The tions of important terms represents an issue that
second active phase, DR, stands for “diagnose and significantly slows down the research in this field.
refine”; it is the outer loop that enables longer-term For example, very often in literature resilience and
evolution of the system in order to enhance the ap- fault-tolerance are considered synonyms and only
proaches to the activities of phase one. few studies attempt to make a difference between
resilience and survivability. Since metrics are de-
9. Open issues rived from definitions, it is not surprising the lack
of a widely accepted framework of formally defined
In the previous sections, we have discussed the metrics for quantifying the resilience of an IP net-
techniques and methodologies that have been pro- work to similar disruptive events. Similarly, we also
posed to face the main challenges in dealing with noticed the lack of shared metrics when quantifying
Internet outages, i.e. (i) dissect specific outage the impact of Internet outages: very often only a
episodes; (ii) systematically detect them; (iii) quan- qualitative evaluation of the impact is carried out.
tify their impact on a network; (iv) understand Some other works defined their own metrics, thus
Challenging problems
Quantify Survive
Quantify Assess
Analyze Detect network and
Impact risk
robustness mitigate
Common methodologies and metrics 7 7 7 7 7 7
Basic techniques 7
Methodologies for outage analysis 7
Open issues
Validation of outage detection systems 7
Availability of datasets 7 7 7 7
Models for risk assessment 7 7
Securing inter-domain routing 7
Cloud resilience 7 7
Wired-cum-wireless networks resilience 7 7 7
Coordination 7 7 7
Table 6: Mapping open research issues on challenging problems addressing Internet Outages.
preventing from a systematic approach to the sub- Current data sources (e.g., Traceroute or BGP)
ject. Clearly, this strongly weakens our ability to cause the obtained topology model to be inaccu-
compare the several network disruptive events de- rate and incomplete. Incomplete topologies do not
tected or analysed in literature. Finding a conver- contain all the nodes or links of the actual network.
gence point is definitely an important open issue. For instance, BGP-derived AS-level topologies are
Along this direction, we also believe that an effort well known to be incomplete [108, 180, 113, 60].
should be made to define subjective metrics when Inaccurate topologies contain incorrect information
quantifying the impact of an outage. Indeed, in the such as non-existing nodes or links (due to anony-
last years, Quality of Experience is attracting more mous routers [168], hidden routers [192], uneven
and more interest compared to Quality of Service. and per-packet load balancing [39], third-party ad-
The latter focus on objective parameters that do dresses [161], unresolved IP aliasing [163], sampling
not necessarily reflect the quality perceived by the biases [145], etc.), or incorrect node or link at-
end-user which is essentially what a service provider tributes such as the locations of PoPs [93], or the ca-
really cares about. Similarly, we believe that an ef- pacity of links. Some studies proposed approaches
fort should be made to evaluate the impact of an to reduce the inaccuracies of fault diagnosis algo-
outage in terms of how users perceive it. For ex- rithms in the presence of partial topology informa-
ample, whether or not the performance perceived tion. For example, in [117] Holbert et al. proposed
by end-users significantly degrades should be an a strategy to infer the missing portions of a topol-
important aspect to consider when evaluating the ogy, based on the use of UDP datagrams in case
severity of a network outage. some routers in the network drop ICMP messages.
Many other countermeasures have been developed
9.2. Basic techniques in the field of Internet topology discovery [246, 206],
In Sec. 2, we provided an overview of the basic but only some of them have been adopted in Inter-
techniques and data sources commonly used in the net outage-related works. Based on our studies, we
outage-related literature. Unfortunately, these lim-
evaluation of incompleteness and inaccuracy of the their outage detection systems. This challenge af-
topology models exploited in these works might be fects the entire research community involved in In-
very helpful to estimate a sort of “confidentiality ternet measurements and represents a severe ob-
interval” for impact and resilience metrics. stacle to the advancement in the Internet outage-
related field.
9.3. Methodologies for outage analysis
Several works focused on the analysis of spe- 9.5. Availability of datasets
cific episodes of large-scale Internet outages. These
analyses are of the utmost importance since inves- In analogy to the issue of Section 9.4, accessing
tigating specific events increases our understanding detailed data about Internet outages is not an easy
of the scope and the consequences of similar dis- task to accomplish. Indeed, focused datasets are
ruptive events as well as the utility of the instru- rarely available. The Outages Archive [14] and the
ments for gaining insight on them. However, criti- Internet outage dataset [3] are the only two rele-
cally reviewing these works (Sec. 3), we noticed the vant examples currently available, to the best of
lack of a widely accepted methodology: each work our knowledge. The former collects messages ex-
adopted its own approach built on top of a sub- changed by operators and practitioners through the
set of basic techniques and data sources to derive Outages mailing list since 2006. The latter con-
insights and conclusions. We believe that develop- tains the results of measurements campaings aiming
ing a widely accepted structured approach for this at investingating generic or specific Internet out-
type of analysis is an important future direction ages (outages clustering, outages detection, address
in this field. Nowadays, the validity and scope of reachability, studies of hurricanes, etc.). While, in
the findings and conclusions drawn starting from the latter, data is structured and organized at dif-
different methodologies are very hard to quantify. ferent levels of abstraction, the former barely pro-
vides a collection of e-mails, thus requiring addi-
in the Internet, to what extent the achieved con- are carried out leveraging data that can detect out-
clusions can be considered valid?” are only some of ages indirectly, such as path-tracing data (usually
the questions that the research community should obtained through traceroute) or BGP-related infor-
address along this direction. mation. Relevant examples are the datasets made
available by CAIDA [6] or by the ANT Lab [2].
This kind of datasets proved a valuable source of
9.4. Validation of outage detection systems
information, but require full understanding of the
The numerous outage detection systems devel- mechanism, the procedures adopted for the anal-
oped during the last years continuously or oppor- yses and the conditions in which they are carried
tunistically monitor the network with focused or out, thus often needing proper assumptions to be
leveraged. Due to the inherent complexity, we refer
and location also of the small outages, a type of Finally, we believe that relevant examples for ex-
disruption representing a common threat to net- isting outage monitoring and analysis systems are
work operators. Unfortunately, the general climate also worth to be mentioned, although not pub-
of collaboration and competition among the differ- licly providing datasets at time of writing. CAIDA
ent networks in the Internet greatly disincentives recently released a publicly-accessible operational
the network operators to share data or knowledge prototype for IODA [10], a system aimed at moni-
on the threats occurring in their networks. Accord- toring the Internet in near-realtime (leveraging in-
ingly, it is very hard to validate the outcomes of formation related to BGP, IBR, and active prob-
these systems: not being able to enumerate and ing), with the goal of identifying macroscopic In-
deepen false alarms as well as undetected threats, ternet outages significantly impacting an AS or a
researchers can not properly evaluate and improve large fraction of a country.
9.6. Models for Risk Assessment secure. We argue that this could lead to a dramatic
In Section 7, we discussed the notion of risk as- decrease of outages causing logical disruption.
sessment applied to data-network infrastructures.
9.8. Cloud Resilience
We argue that the main problem is modelling the
outage. We reviewed a number of studies that mod- Directly related to what we just observed in the
elled an outage as a disk, which is an oversimplifica- previous paragraph, we can reach a less obvious fu-
ture direction. While modifying the way all routers
tion if compared to the models commonly exploited
in civil engineering. The development of appropri- in the Internet work is very hard, it is relatively
ate outage models appears to be a hot open issue. easy for data-center managers and cloud service
Indeed, for instance, being able to perform a com- providers to modify the way their border routers
work in order to achieve higher resilience against
prehensive risk assessment study on a network in-
frastructure would enable companies insuring them. outages occurring in the public Internet. To a cer-
To this aim, predictive models would be of utmost tain extent, this is true for ISPs as well. Based
importance, to evaluate the possible evolution of on this, routing schemes or other solutions aim-
the network infrastructure given the knowledge of ing at improving the service resilience could be
its current status, and to assess the (improved) con- deployed. An example was presented earlier with
sequences of a risk mitigation strategy. Unfortu- LIFEGUARD [134]. The unilateral deployment of
nately, most of the outages causes are specifically failure avoidance techniques would avoid the previ-
hard to predict. Regarding natural disasters there ously described limitations. We claim that a major
is a long history of scientific research addressing effort should be put along this direction, motivated
predictions of earthquakes, hurricanes, and similar by the strict SLAs that must be met by these ser-
catastrophic events [177], and it is a hard scientific vice providers. To this aim, promising technologies
quest still open [190]. Regarding human-related are emerging related to Software Defined Networks
causes, the prediction of malicious activities is also (SDN), enabling the needed experimentation (and
specifically hard: some preliminary work has been evolvability) on operating data center networks [76].
done modeling the psychology of an unfaithful em-
9.9. Wired-cum-Wireless Networks Resilience
ployee (a so-called insider threat) [128], although
it is additionally hindered by the real-world ap- Wireless networks and mobile services are be-
plicability and privacy concerns related to the re- coming the more and more a part of the global
quired surveillance of workers. All these difficulties communication infrastructure in the convergence
add to the inherently hard problem of modeling a of networks towards the Internet of Things. De-
highly dynamic distributed system—the Internet— spite this, the resilience of wide area networks com-
whose behavior emerges from the communications prising wireless paths has not been studied exten-
of millions human users and increasingly hard-to- sively as their steadily growing importance would
An important future direction is securing the bandwidth-) data communications were cited but
inter-domain routing. Indeed, this solution may ef- not estimated in the outage analysis: we refer
fectively prevent Internet outages caused by mali- to [220] for an early analysis of outages and wire-
cious behaviors such as prefix hijacking attacks or less network survivability in a pre-convergence sce-
accidental misconfigurations as incorrect prepend- nario. Other works have addressed the performance
ing. Although BGP is known to be affected by these of wireless metropolitan area networks [130, 131],
and other security vulnerabilities (see RFC 4272 or evaluated (in simulation) the interdependence
[173] and Huston et al. [123]), securing inter-domain between electrical power distribution networks and
routing still represents an open issue. Indeed, mod- mobile networks in case of faults [120]. Finally, due
ifying the way the whole Internet works proved to to their ease of deployment wireless networks have
be an extremely complex process also often referred been considered as means for backup and mitiga-
to as Internet ossification. Therefore, we highlight tion in case of disasters [61, 100, 203].
a quite obvious yet essential future direction: find- Compared with the scientific corpus available for
ing applicable ways to make prefix advertisements wired wide-area-networks, the paucity of research
on outage analysis, mitigation and prevention for this topic, providing an extensive and carefully or-
networks including wireless paths is evident. In our ganized picture of the literature related to Inter-
opinion this can be a symptom that several already net outages. Moreover, to the best of our knowl-
mentioned open issues (especially the lack of com- edge, we provided several innovative contributions
mon definitions and metrics and the lack of models achieved through this study: (i) a road to systemat-
for risk assessment) have so far limited the interest ically study Internet outages; (ii) a characterization
and research of this aspect. Nevertheless we expect of the causes of these disruptive events; (iii) a classi-
more attention to be drawn to this specific topic in fication of the basic techniques used by researchers;
sight of the spread of Industry 4.0 scenarios [239]. (iv) a systematic analysis of works dissecting spe-
cific Internet outage episodes, underlying common
9.10. Coordination practices, weaknesses, and differences of the pro-
Small outages located inside a given network can posed approaches; (v) a general approach for out-
easily be detected, located, and fixed by the corre- age detection; (vi) a classification of approaches for
sponding operator. On the other hand, large-scale outage impact evaluation; (vii) an apportionment of
outages involving final users, content providers, and definitions and metrics for evaluating the resilience
multiple transit networks definitely require the co- of a communication network; (viii) a systematic dis-
ordinated action of different entities forced to share cussion on the assessment of risk of networks to out-
knowledge and data to find and solve the issue. ages; (ix) an overview of the solutions proposed to
The outage mailing lists [14] represents a common prevent, mitigate, or resolve Internet outages and
way for network operators to advertise or enquiry their consequences. (x) a detailed analysis of open
about Internet outages. Compared to the highly issues and future directions in this research field.
sophisticated technology employed in their infras- The paper constitutes an important starting
tructure, this coordination tool appears somehow point for researchers willing to simply understand
anachronistic and largely ineffective. Hence, we no- or to contribute to this wide and articulate research
cable maintenance
power supply maintenance
power monitoring
company facility or hardware board maintenance
software versions (mismatches)
following software maintenance procedures
data entry
* Kuhn [141] cable cuttings
acts of nature
hardware failures
normal operation
software failures
recovery mode
high failure links
based on causes single failure
low failure links
Markopoulou et al. [166]
unscheduled routed-related
shared failure optical-related
multiple-links failure
malicious attacks
unusual but legitimate traffic
environmental challenges
[this paper] intentionality
primarily physical
disruption type
purely logical
partial peering teardown
AS partition
Wu et al. [247]
Figure 3: Classification of Internet outages: taxonomies from literature and this paper.
* Considers Public Switched Telephone Network, not the Internet.
power outages misconfiguration
PHYSICAL legacy equipment LOGICAL
military attacks censorship
terrorism software attacks INTENTIONAL
Figure 4: A characterization of the causes of outages based on the origin, the intentionality, and the type of disruption.
Non traffic-
Traffic volume
Malware traffic
P2P traffic
Inter-domain Intra-domain
Data collection & Network outage
pre-processing detection
Figure 6: Flowchart of a generic outage detection tool.
Overall Resilient
Automatic Protection Switching Resilient Packet Ring Protection
(APS) [226] (RPRP) [226]
Figure 8: A classification of the specific solutions to prevent or mitigate the consequences of network outages.
Giuseppe Aceto ( is a Post Doc at the Department of Electrical Engineering and Information
Technology of University of Napoli Federico II. Giuseppe has a PhD in telecommunication engineering from the University
of Napoli Federico II. His work falls in measurement and monitoring of network performance and security, with focus
on censorship. He has served and serves as reviewer for several journals and conferences (e.g. IEEE Transactions on
Cloud Computing, Elsevier’s JNCA, Computer Communications, Globecom, ICC). Giuseppe Aceto is co-author of papers
in international publishing venues (IEEE Transactions on Network and Service Management, Elsevier Journal of Network
and Computer Applications, Computer Networks, INFOCOM, ACM SAC, etc.) and is co-author of a patent. Giuseppe is the
recipient of a best paper award at IEEE ISCC 2010.
Alessio Botta received the M.S. degree in telecommunications engineering and the Ph.D. degree in computer engineering
and systems from the University of Naples Federico II, Naples, Italy. He currently holds a post-doctoral position with
the Department of Computer Engineering and Systems, University of Naples Federico II. He has co-authored over 50
international journal and conference publications. His current research interests include networking, and, in particular,
network performance measurement and improvement, with a focus on wireless and heterogeneous systems. Dr. Botta
has served and serves as an independent reviewer of research and implementation project proposals for the Romanian
government. He was a recipient of the Best Local Paper Award at the IEEE ISCC 2010. In the research area of networking,
he has chaired international conferences and workshops, served and serves several technical program committees of
international conferences (IEEE Globecom and IEEE ICC), and acted as a reviewer for different international conferences
(the IEEE Conference on Computer Communications) and journals (the IEEE Transactions on Mobile Computing, the IEEE
Network Magazine, and the IEEE Transactions on Vehicular Technology).
Pietro Marchetta received his Master degree and PhD degree in Computer Engineering at University of Napoli in 2014.
Currently, he holds a postdoctoral position at the Department of Electrical Engineering and Information Technology of the
University of Napoli Federico II (Italy). His main research activities focus on methodologies, techniques and large-scale
distributed platforms for Internet measurements with a specific focus on Internet topology, routing, and performance. He
served and serves a reviewer for a dozen of conferences and journals (e.g. Elsevier’s Computer Networks and Future
Generation Computer Systems). For his research, he received some awards including the first place at the ACM Student
Research Competition at SIGCOMM 2012 and the Best Student Workshop Paper Award in CoNEXT 2013. Pietro Marchetta
has also been IT lead for the Smart City project “S2move – Smart and Social Move” financed by MIUR.
Valerio Persico is a Post Doc at the Department of Electrical Engineering and Information Technology of University of
Napoli Federico II. He has a PhD in computer engineering from the University of Napoli Federico II. His work focuses
on measurement and monitoring of cloud network infrastructures. Recently, he is working on bioinformatic. Valerio is the
recipient of the best student paper award at ACM CoNext 2013.
Antonio Pescapé [SM ’09] is a Full Professor atthe Department of Electrical Engineering and Information Technology
of the University of Napoli Federico II (Italy). His research interests are in the networking field with focus on Internet
Monitoring, Measurements and Management and on Network Security. Antonio Pescapé has coauthored over 180 journal
and conference publications and he is co-author of a patent. For his research activities he has received several awards,
comprising a Google Faculty Award, several best paper awards and two IRTF (Internet Research Task Force) ANRP (Applied
Networking Research Prize).