A Social Network Under Social Distancing: Risk-Driven Backbone Management During COVID-19 and Beyond

As the COVID-19 pandemic reshapes our social landscape, its lessons have far-reaching implications on how online service providers manage their infrastructure to mitigate risks This paper presents Facebook's risk-driven backbone management strategy to ensure high service performance throughout the COVID-19 pandemic We describe Risk Simulation System (RSS), a production system that identifies possible failures and quantifies their potential severity with a set of metrics for network risk With a year-long risk measurement from RSS we show that our backbone resiliently withstood the COVID-19 stress test, achieving high service availability and low route dilation while efficiently handling traffic surges We also share our operational practices to mitigate risk throughout the pandemic Our findings give insights to further improve risk-driven network management We argue for incorporating short-term failure statistics in modeling failures Common failure prediction models based on long-term modeling achieve stable output at the cost of assigning low significance to unique short-term events of extreme importance such as COVID-19 Furthermore, we advocate augmenting network management techniques with non-networking signals We support this by identifying and analyzing the correlation between network traffic and human mobility © 2021 by The USENIX Association

[1]  Xipeng Xiao,et al.  Internet QoS: a big picture , 1999, IEEE Netw..

[2]  Athina Markopoulou,et al.  Characterization of failures in an IP backbone , 2004, IEEE INFOCOM 2004.

[3]  Biswanath Mukherjee,et al.  Minimizing the Risk From Disaster Failures in Optical Backbone Networks , 2014, Journal of Lightwave Technology.

[4]  Alberto Dainotti,et al.  How to Find Correlated Internet Failures , 2019, PAM.

[5]  Nick Feamster,et al.  Measuring the effects of internet path faults on reactive routing , 2003, SIGMETRICS '03.

[6]  Srikanth Kandula,et al.  Traffic engineering with forward fault correction , 2014, SIGCOMM.

[7]  Don Towsley,et al.  Theories and models for Internet quality of service , 2002, Proc. IEEE.

[8]  Benjamin Letham,et al.  Forecasting at Scale , 2018, PeerJ Prepr..

[9]  Michael Schapira,et al.  TEAVAR: striking the right utilization-availability balance in WAN traffic engineering , 2019, SIGCOMM.

[10]  Vimal Bhatia,et al.  Network and Risk Modeling for Disaster Survivability Analysis of Backbone Optical Communication Networks , 2019, Journal of Lightwave Technology.

[11]  Christoph Dietzel,et al.  The Lockdown Effect: Implications of the COVID-19 Pandemic on Internet Traffic , 2020, Internet Measurement Conference.

[12]  Amin Vahdat,et al.  B4 and after: managing hierarchy, partitioning, and asymmetry for availability and scale in google's software-defined WAN , 2018, SIGCOMM.

[13]  Ludovic Noirie,et al.  Dynamic risk-aware routing for OSPF networks , 2013, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013).

[14]  Ítalo S. Cunha,et al.  LIFEGUARD: practical repair of persistent route failures , 2012, SIGCOMM '12.

[15]  Andra Lutu,et al.  A Characterization of the COVID-19 Pandemic Impact on a Mobile Network Operator Traffic , 2020, Internet Measurement Conference.

[16]  Seth D. Guikema,et al.  Efficient traffic loss evaluation for transport backbone networks , 2010, Comput. Networks.

[17]  Qiong Wang,et al.  Stochastic traffic engineering for demand uncertainty and risk-aware network revenue management , 2005, TNET.

[18]  Farnam Jahanian,et al.  Experimental study of Internet stability and backbone failures , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[19]  Jane M. Simmons Catastrophic Failures in a Backbone Network , 2012, IEEE Communications Letters.

[20]  Rahul Potharaju,et al.  When the network crumbles: an empirical study of cloud network failures and their impact on services , 2013, SoCC.

[21]  Randy H. Katz,et al.  OverQoS: offering Internet QoS using overlays , 2003, CCRV.

[22]  Ghida Ibrahim,et al.  How the Internet reacted to Covid-19: A perspective from Facebook's Edge Network , 2020, Internet Measurement Conference.

[23]  Matthew Roughan,et al.  Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification , 2004, IMC '04.

[24]  Marco Chiesa,et al.  Analysis of country-wide internet outages caused by censorship , 2011, IMC '11.

[25]  Minlan Yu,et al.  Risk based planning of network changes in evolving data centers , 2019, SOSP.

[26]  Bryan Ng,et al.  Developing a traffic classification platform for enterprise networks with SDN: Experiences & lessons learned , 2015, 2015 IFIP Networking Conference (IFIP Networking).

[27]  Chen-Nee Chuah,et al.  Characterization of Failures in an Operational IP Backbone Network , 2008, IEEE/ACM Transactions on Networking.

[28]  Mark Filer,et al.  RADWAN: rate adaptive wide area network , 2018, SIGCOMM.

[29]  Christoph Albrecht,et al.  Capacity planning for the Google backbone network , 2015 .

[30]  Ramesh Govindan,et al.  Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure , 2016, SIGCOMM.

[31]  Ratul Mahajan,et al.  Evaluation of elastic modulation gains in microsoft's optical backbone in North America , 2016, 2016 Optical Fiber Communications Conference and Exhibition (OFC).

[32]  Biswanath Mukherjee,et al.  Provisioning Short-Term Traffic Fluctuations in Elastic Optical Networks , 2019, IEEE/ACM Transactions on Networking.

[33]  Roger Wattenhofer,et al.  Resilience Characteristics of the Internet Backbone Routing Infrastructure , 2000 .

[34]  Yu Wang,et al.  Curating a COVID-19 data repository and forecasting county-level death counts in the United States , 2020, Harvard Data Science Review.

[35]  C.-C. Jay Kuo,et al.  Internet Traffic Classification for Scalable QOS Provision , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[36]  Xin Liu,et al.  Deep Learning for Encrypted Traffic Classification: An Overview , 2018, IEEE Communications Magazine.

[37]  Alice Hutchings,et al.  Turning Up the Dial: the Evolution of a Cybercrime Market Through Set-up, Stable, and Covid-19 Eras , 2020, Internet Measurement Conference.

[38]  Monia Ghobadi,et al.  Optical Layer Failures in a Large Backbone , 2016, Internet Measurement Conference.