Passive Realtime Datacenter Fault Detection and Localization

Datacenters are characterized by their large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with small but non-zero failure rates mean that datacenters are subject to significant numbers of failures, impacting the performance of the services that rely on them. To make matters worse, these failures are not always obvious; network switches and links can fail partially, dropping or delaying various subsets of packets without necessarily delivering a clear signal that they are faulty. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors. We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate transport-layer flow metrics and network-I/O system call delay at end hosts with the path that traffic takes through the datacenter and apply statistical analysis techniques to identify outliers and localize the faulty link and/or switch(es). We evaluate our approach in a production Facebook front-end datacenter.

[1]  Amin Vahdat,et al.  Dahu: Commodity switches for direct connect data center networks , 2013, Architectures for Networking and Communications Systems.

[2]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[3]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[4]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[5]  Vijay Mann,et al.  Living on the edge: Monitoring network flows at the edge in cloud data centers , 2013, 2013 Fifth International Conference on Communication Systems and Networks (COMSNETS).

[6]  N. Duffield,et al.  Network loss tomography using striped unicast probes , 2006, IEEE/ACM Transactions on Networking.

[7]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[8]  References , 1971 .

[9]  Marcos K. Aguilera,et al.  WAP5: black-box performance debugging for wide-area systems , 2006, WWW '06.

[10]  Matthew Roughan,et al.  IP forwarding anomalies and improving their detection using multiple data sources , 2004, NetT '04.

[11]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[12]  George Varghese,et al.  Gestalt: Fast, Unified Fault Localization for Networked Systems , 2014, USENIX Annual Technical Conference.

[13]  Alex C. Snoeren,et al.  Inside the Social Network's (Datacenter) Network , 2015, Comput. Commun. Rev..

[14]  Mudhakar Srivatsa,et al.  A Framework for Distributed Monitoring and Root Cause Analysis for Large IP Networks , 2009, 2009 28th IEEE International Symposium on Reliable Distributed Systems.

[15]  Xin Wu,et al.  NetPilot: automating datacenter network failure mitigation , 2012, SIGCOMM '12.

[16]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[17]  Albert G. Greenberg,et al.  Detection and Localization of Network Black Holes , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[18]  B. Welford Note on a Method for Calculating Corrected Sums of Squares and Products , 1962 .

[19]  Behnaz Arzani,et al.  Taking the Blame Game out of Data Centers Operations with NetPoirot , 2016, SIGCOMM.

[20]  George Forman,et al.  Automated Whole-System Diagnosis of Distributed Services Using Model-Based Reasoning , 1998 .

[21]  Albert G. Greenberg,et al.  Fault Localization via Risk Modeling , 2010, IEEE Transactions on Dependable and Secure Computing.

[22]  Hong Liu,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[25]  Stefan Savage,et al.  California fault lines: understanding the causes and impact of network failures , 2010, SIGCOMM '10.

[26]  S. Savage,et al.  On Failure in Managed Enterprise Networks , 2012 .

[27]  Abdul Kabbani,et al.  FlowBender: Flow-level Adaptive Routing for Improved Latency and Throughput in Datacenter Networks , 2014, CoNEXT.

[28]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[29]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[30]  Armando Fox,et al.  Pinpoint: problem determination in large , 2002 .

[31]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[32]  D. Zats,et al.  DeTail: reducing the flow completion time tail in datacenter networks , 2012, CCRV.