Taking the Blame Game out of Data Centers Operations with NetPoirot

Today, root cause analysis of failures in data centers is mostly done through manual inspection. More often than not, cus- tomers blame the network as the culprit. However, other components of the system might have caused these failures. To troubleshoot, huge volumes of data are collected over the entire data center. Correlating such large volumes of diverse data collected from different vantage points is a daunting task even for the most skilled technicians. In this paper, we revisit the question: how much can you infer about a failure in the data center using TCP statistics collected at one of the endpoints? Using an agent that cap- tures TCP statistics we devised a classification algorithm that identifies the root cause of failure using this information at a single endpoint. Using insights derived from this classi- fication algorithm we identify dominant TCP metrics that indicate where/why problems occur in the network. We val- idate and test these methods using data that we collect over a period of six months in a production data center.

[1]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[2]  Geoffrey M. Voelker,et al.  NetPrints: Diagnosing Home Network Misconfigurations Using Shared Knowledge , 2009, NSDI.

[3]  Satoshi Matsuoka,et al.  Latent Fault Detection With Unbalanced Workloads , 2015, EDBT/ICDT Workshops.

[4]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[5]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[6]  Yin Zhang,et al.  On the characteristics and origins of internet flow rates , 2002, SIGCOMM '02.

[7]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[8]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[9]  Koushik Chakraborty,et al.  Adapting to intermittent faults in multicore systems , 2008, ASPLOS.

[10]  Konstantina Papagiannaki,et al.  Identifying the root cause of video streaming issues on mobile devices , 2015, CoNEXT.

[11]  Nick Feamster,et al.  Practical issues with using network tomography for fault diagnosis , 2008, CCRV.

[12]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[13]  Matthew Mathis,et al.  Pathdiag: Automated TCP Diagnosis , 2008, PAM.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[16]  Yao Zhao,et al.  Towards Unbiased End-to-End Network Diagnosis , 2006, IEEE/ACM Transactions on Networking.

[17]  Sriram Ramabhadran,et al.  NetProfiler: Profiling Wide-Area Networks Using Peer Cooperation , 2005, IPTPS.

[18]  Y. Ahmet Sekercioglu,et al.  Intelligent Automated Diagnosis of Client Device Bottlenecks in Private Clouds , 2011, 2011 Fourth IEEE International Conference on Utility and Cloud Computing.

[19]  Ling Huang,et al.  In-Network PCA and Anomaly Detection , 2006, NIPS.

[20]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[21]  Ariel Tseitlin The Antifragile Organization , 2013, ACM Queue.

[22]  Minlan Yu,et al.  Profiling Network Performance for Multi-tier Data Center Applications , 2011, NSDI.

[23]  Susan I. Hruska,et al.  Expert network development environment for automating machine fault diagnosis , 1996, Defense + Commercial Sensing.

[24]  Mark Burgess,et al.  Probabilistic anomaly detection in distributed computer networks , 2006, Sci. Comput. Program..

[25]  Erik Elmroth,et al.  Performance Anomaly Detection and Bottleneck Identification , 2015, ACM Comput. Surv..

[26]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.