Detecting Internet Outages with Precise Active Probing ( extended )

Parts of the Internet are down every day, from the intentional shutdown of the Egyptian Internet in Jan. 2011 and natural disasters such as the Mar. 2011 Japanese earthquake, to the thousands of small outages caused by localized accidents, and human error, maintenance, or choices. Understanding these events requires efficient and accurate detection methods, motivating our new system to detect network outages by active probing. We show that a single computer can track outages across the entire analyzable IPv4 Internet, probing a sample of 20 addresses in all 2.5M responsive /24 address blocks. We show that our approach is significantly more accurate than the best current methods, with 31% fewer false conclusions, while providing 14% greater coverage and requiring about the same probing traffic. We develop new algorithms to identify outages and cluster them to events, providing the first visualization of outages. We carefully validate our approach, showing consistent results over two years and from three different sites. Using public BGP archives and news sources we confirm 83% of large events. For a random sample of 50 observed events, we find 38% in partial control-plane information, reaffirming prior work that small outages are often not caused by BGP. Through controlled emulation we show that our approach detects 100% of fullblock outages that last at least twice our probing interval. Finally, we report on Internet stability as a whole, and the size and duration of typical outages, using core-to-edge observations with much larger coverage than prior mesh-based studies. We find that about 0.3% of the Internet is likely to be unreachable at any time, suggesting the Internet provides only 2.5 “nines” of availability.

[1]  J. Heidemann,et al.  Detecting Internet Outages with Active Probing , 2011 .

[2]  Arun Venkataramani,et al.  iPlane: an information plane for distributed services , 2006, OSDI '06.

[3]  Ítalo S. Cunha,et al.  Measurement methods for fast and accurate blackhole identification with binary tomography , 2009, IMC '09.

[4]  Abhijit Bose,et al.  Delayed Internet routing convergence , 2000, SIGCOMM.

[5]  Ming Zhang,et al.  PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services , 2004, OSDI.

[6]  Farnam Jahanian,et al.  Internet routing instability , 1997, SIGCOMM '97.

[7]  Renata Teixeira,et al.  A measurement framework for pin-pointing routing changes , 2004, NetT '04.

[8]  Anja Feldmann,et al.  Locating internet routing instabilities , 2004, SIGCOMM '04.

[9]  Albert G. Greenberg,et al.  Detection and Localization of Network Black Holes , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[10]  Zihui Ge,et al.  Crowdsourcing service-level network event monitoring , 2010, SIGCOMM '10.

[11]  Ramesh Govindan,et al.  Heuristics for Internet map discovery , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[12]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[13]  Ramesh Govindan,et al.  Census and survey of the visible internet , 2008, IMC '08.

[14]  David Wetherall,et al.  Studying Black Holes in the Internet with Hubble , 2008, NSDI.

[15]  John S. Heidemann,et al.  Understanding block-level address usage in the visible internet , 2010, SIGCOMM '10.

[16]  V. Paxson End-to-end routing behavior in the internet , 2006, CCRV.

[17]  Olaf Maennel,et al.  Testing the reachability of (new) address space , 2007, INM '07.

[18]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[19]  Olaf Maennel,et al.  Internet optometry: assessing the broken glasses in internet reachability , 2009, IMC '09.

[20]  Nick Feamster,et al.  Practical issues with using network tomography for fault diagnosis , 2008, CCRV.

[21]  Yuval Shavitt,et al.  DIMES: let the internet measure itself , 2005, CCRV.

[22]  Ramesh Govindan,et al.  The temporal and topological characteristics of BGP path changes , 2003, 11th IEEE International Conference on Network Protocols, 2003. Proceedings..

[23]  Renata Teixeira,et al.  NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data , 2007, CoNEXT '07.

[24]  Ken Keys,et al.  Internet-scale IP alias resolution techniques , 2010, CCRV.

[25]  Lixia Zhang,et al.  BGPmon: A Real-Time, Scalable, Extensible Monitoring System , 2009, 2009 Cybersecurity Applications & Technology Conference for Homeland Security.

[26]  Dmitri Loguinov,et al.  Demystifying service discovery: implementing an internet-wide scanner , 2010, IMC '10.

[27]  Nick Feamster,et al.  Measuring the effects of internet path faults on reactive routing , 2003, SIGMETRICS '03.

[28]  Athina Markopoulou,et al.  Characterization of failures in an IP backbone , 2004, IEEE INFOCOM 2004.

[29]  Eric Wustrow,et al.  Internet background radiation revisited , 2010, IMC '10.

[30]  John S. Heidemann,et al.  Selecting representative IP addresses for internet topology studies , 2010, IMC '10.

[31]  Private Communications , 2001 .

[32]  Ratul Mahajan,et al.  Understanding BGP misconfiguration , 2002, SIGCOMM '02.

[33]  Alberto Dainotti,et al.  Extracting benefit from harm: using malware pollution to analyze the impact of political and geophysical events on the internet , 2012, CCRV.

[34]  Nick Feamster,et al.  Diagnosing network disruptions with network-wide analysis , 2007, SIGMETRICS '07.