Advancing the Art of Internet Edge Outage Detection

Measuring reliability of edge networks in the Internet is difficult due to the size and heterogeneity of networks, the rarity of outages, and the difficulty of finding vantage points that can accurately capture such events at scale. In this paper, we use logs from a major CDN, detailing hourly request counts from address blocks. We discovered that in many edge address blocks, devices, collectively, contact the CDN every hour over weeks and months. We establish that a sudden temporary absence of these requests indicates a loss of Internet connectivity of those address blocks, events we call disruptions. We develop a disruption detection technique and present broad and detailed statistics on 1.5M disruption events over the course of a year. Our approach reveals that disruptions do not necessarily reflect actual service outages, but can be the result of prefix migrations. Major natural disasters are clearly represented in our data as expected; however, a large share of detected disruptions correlate well with planned human intervention during scheduled maintenance intervals, and are thus unlikely to be caused by external factors. Cross-evaluating our results we find that current state-of-the-art active outage detection over-estimates the occurrence of disruptions in some address blocks. Our observations of disruptions, service outages, and different causes for such events yield implications for the design of outage detection systems, as well as for policymakers seeking to establish reporting requirements for Internet services.

[1]  Hantian Wu,et al.  A Multi-Perspective Analysis: , 2015 .

[2]  John S. Heidemann,et al.  Trinocular: understanding internet reliability through adaptive probing , 2013, SIGCOMM.

[3]  Robert Beverly,et al.  Measuring and Characterizing IPv6 Router Availability , 2015, PAM.

[4]  Yuval Shavitt,et al.  DIMES: let the internet measure itself , 2005, CCRV.

[5]  Anja Feldmann,et al.  A Multi-perspective Analysis of Carrier-Grade NAT Deployment , 2016, Internet Measurement Conference.

[6]  Robert Beverly,et al.  The Impact of Router Outages on the AS-level Internet , 2017, SIGCOMM.

[7]  David Wetherall,et al.  Studying Black Holes in the Internet with Hubble , 2008, NSDI.

[8]  David Plonka,et al.  Temporal and Spatial Classification of Active IPv6 Addresses , 2015, Internet Measurement Conference.

[9]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[10]  Marco Chiesa,et al.  Analysis of country-wide internet outages caused by censorship , 2014, TNET.

[11]  Ramesh Govindan,et al.  Census and survey of the visible internet , 2008, IMC '08.

[12]  John P. Rula,et al.  Cell spotting: studying the role of cellular networks in the internet , 2017, Internet Measurement Conference.

[13]  S. Fotopoulos,et al.  Inference for single and multiple change‐points in time series , 2013 .

[14]  Nick Feamster,et al.  BISmark: A Testbed for Deploying Measurements and Applications in Broadband Access Networks , 2014, USENIX Annual Technical Conference.

[15]  Renata Teixeira,et al.  NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data , 2007, CoNEXT '07.

[16]  Abhijit Bose,et al.  Delayed Internet routing convergence , 2000, SIGCOMM.

[17]  V. Paxson End-to-end routing behavior in the internet , 2006, CCRV.

[18]  Arun Kejariwal,et al.  A Novel Technique for Long-Term Anomaly Detection in the Cloud , 2014, HotCloud.

[19]  Georgios Smaragdakis,et al.  Beyond Counting: New Perspectives on the Active IPv4 Address Space , 2016, Internet Measurement Conference.

[20]  David Plonka,et al.  kIP: a Measured Approach to IPv6 Address Anonymization , 2017, ArXiv.

[21]  Kimberly C. Claffy,et al.  Reasons Dynamic Addresses Change , 2016, Internet Measurement Conference.

[22]  Anja Feldmann,et al.  Detecting Peering Infrastructure Outages in the Wild , 2017, SIGCOMM.

[23]  Fabián E. Bustamante,et al.  Need, Want, Can Afford: Broadband Markets and the Behavior of Users , 2014, Internet Measurement Conference.

[24]  Eric Wustrow,et al.  ZMap: Fast Internet-wide Scanning and Its Security Applications , 2013, USENIX Security Symposium.

[25]  Randy Bush,et al.  A Primer on IPv4 Scarcity , 2014, CCRV.

[26]  Irma J. Terpenning,et al.  STL : A Seasonal-Trend Decomposition Procedure Based on Loess , 1990 .

[27]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[28]  Olaf Maennel,et al.  Internet optometry: assessing the broken glasses in internet reachability , 2009, IMC '09.

[29]  Ítalo S. Cunha,et al.  LIFEGUARD: practical repair of persistent route failures , 2012, SIGCOMM '12.

[30]  D. Stephens Bayesian Retrospective Multiple‐Changepoint Identification , 1994 .

[31]  Randy Bush,et al.  Disco: Fast, good, and cheap outage detection , 2017, 2017 Network Traffic Measurement and Analysis Conference (TMA).

[32]  Mark Crovella,et al.  Studying interdomain routing over long timescales , 2013, Internet Measurement Conference.

[33]  Aaron Schulman,et al.  Pingin' in the rain , 2011, IMC '11.

[34]  Nick Feamster,et al.  Peeking behind the NAT: an empirical study of home networks , 2013, Internet Measurement Conference.

[35]  Ítalo S. Cunha,et al.  PoiRoot: investigating the root cause of interdomain path changes , 2013, SIGCOMM.

[36]  Alberto Dainotti,et al.  Leveraging Internet Background Radiation for Opportunistic Network Analysis , 2015, Internet Measurement Conference.

[37]  Balachander Krishnamurthy,et al.  Dasu: Pushing Experiments to the Internet's Edge , 2013, NSDI.

[38]  J. Heidemann,et al.  Back Out : End-to-end Inference of Common Points-of-Failure in the Internet ( extended ) , 2018 .

[39]  Vyas Sekar,et al.  Internet Outages, the Eyewitness Accounts: Analysis of the Outages Mailing List , 2015, PAM.

[40]  Yuval Shavitt,et al.  On the Dynamics of IP Address Allocation and Availability of End-Hosts , 2010, ArXiv.