Disco: Fast, good, and cheap outage detection

Outage detection has been studied from different angles, such as active probing, analysis of background radiations, or control plane information. We approach outage detection from a new perspective. Disco is a detection technique that uses existing long-running TCP connections to identify bursts of disconnections. The benefits are considerable as we can monitor, without adding a single packet to the traffic, Internet-wide swaths of infrastructure that were not monitored previously because they are, for example, not responsive to ICMP probes or behind NATs. With Disco we analyze state changes on connections between RIPE Atlas probes and the RIPE Atlas infrastructure. This data, that is originally logged to monitor probe availability, has a small footprint and is available as a publicly accessible live stream, which makes light-weight near real-time outage detection possible. Probes perform planned traceroute measurements regardless of their connectivity to the RIPE Atlas infrastructure. This gives us a no cost advantage of viewing the outage inside out as the probes experienced it, characterizing the outage after the fact. Thus, we present an outage detection system able to run in near real-time (fast), with a precision of 95% (good), and without generating any new measurement traffic (cheap). We studied historical probe disconnections from 2011 to 2016 and report on the 443 most prominent outages. To validate our results we inspected traceroute results from affected probes and compared our detection to that of Trinocular.

[1]  Ming Zhang,et al.  PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services , 2004, OSDI.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  David Wetherall,et al.  Studying Black Holes in the Internet with Hubble , 2008, NSDI.

[4]  Marco Chiesa,et al.  Analysis of country-wide internet outages caused by censorship , 2011, IMC '11.

[5]  John S. Heidemann,et al.  Trinocular: understanding internet reliability through adaptive probing , 2013, SIGCOMM.

[6]  Roksana Boreli,et al.  Privacy preserving distributed network outage monitoring , 2013, 2013 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[7]  Albert-László Barabási,et al.  The origin of bursts and heavy tails in human dynamics , 2005, Nature.

[8]  K. Fukuda,et al.  Disasters seen through Flickr cameras , 2011, SWID '11.

[9]  Randy Bush,et al.  Pinpointing delay and forwarding anomalies using large-scale traceroute measurements , 2016, Internet Measurement Conference.

[10]  Roksana Boreli,et al.  Federated flow-based approach for privacy preserving connectivity tracking , 2013, CoNEXT.

[11]  kc claffy,et al.  Geocompare: a comparison of public and commercial geolocation databases - Technical Report , 2011 .

[12]  Wolfgang Mühlbauer,et al.  FACT: Flow-Based Approach for Connectivity Tracking , 2011, PAM.

[13]  Jon M. Kleinberg,et al.  Temporal Dynamics of On-Line Information Streams , 2016, Data Stream Management.

[14]  Ravi Kumar,et al.  On the Bursty Evolution of Blogspace , 2003, WWW '03.

[15]  Vyas Sekar,et al.  Internet Outages, the Eyewitness Accounts: Analysis of the Outages Mailing List , 2015, PAM.

[16]  Alberto Dainotti,et al.  Lost in Space: Improving Inference of IPv4 Address Space Utilization , 2016, IEEE Journal on Selected Areas in Communications.