Troubleshooting chronic conditions in large IP networks

Chronic network conditions are caused by performance impairing events that occur intermittently over an extended period of time. Such conditions can cause repeated performance degradation to customers, and sometimes can even turn into serious hard failures. It is therefore critical to troubleshoot and repair chronic network conditions in a timely fashion in order to ensure high reliability and performance in large IP networks. Today, troubleshooting chronic conditions is often performed manually, making it a tedious, time-consuming and error-prone process. In this paper, we present NICE (Network-wide Information Correlation and Exploration), a novel infrastructure that enables the troubleshooting of chronic network conditions by detecting and analyzing statistical correlations across multiple data sources. NICE uses a novel circular permutation test to determine the statistical significance of correlation. It also allows flexible analysis at various spatial granularity (e.g., link, router, network level, etc.). We validate NICE using real measurement data collected at a tier-1 ISP network. The results are quite positive. We then apply NICE to troubleshoot real network issues in the tier-1 ISP network. In all three case studies conducted so far, NICE successfully uncovers previously unknown chronic network conditions, resulting in improved network operations.

[1]  Piotr Cholda,et al.  Network Recovery, Protection and Restoration of Optical, SONET-SDH, IP, and MPLS [Book Review] , 2005, IEEE Communications Magazine.

[2]  Paul Barford,et al.  A signal analysis of network traffic anomalies , 2002, IMW '02.

[3]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[4]  Ranveer Chandra,et al.  What's going on?: learning communication rules in edge networks , 2008, SIGCOMM '08.

[5]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[6]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[7]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[8]  Nick Feamster,et al.  Diagnosing network disruptions with network-wide analysis , 2007, SIGMETRICS '07.

[9]  Albert G. Greenberg,et al.  Detection and Localization of Network Black Holes , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[10]  Albert G. Greenberg,et al.  OSPF Monitoring: Architecture, Design, and Deployment Experience , 2004, NSDI.

[11]  Piet Demeester,et al.  Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS , 2004 .

[12]  Ling Huang,et al.  Communication-Efficient Online Detection of Network-Wide Anomalies , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[13]  Sam Kash Kachigan Statistical Analysis: An Interdisciplinary Introduction to Univariate & Multivariate Methods , 1986 .

[14]  Mark Crovella,et al.  Distributed Spatial Anomaly Detection , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[15]  Athina Markopoulou,et al.  Characterization of Failures in an IP Backbone Network , 2004, INFOCOM.

[16]  Wilfred Ng,et al.  Mining quantitative correlated patterns using an information-theoretic approach , 2006, KDD '06.

[17]  D. Sheskin The Pearson Product-Moment Correlation Coefficient , 2003 .

[18]  Renata Teixeira,et al.  NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data , 2007, CoNEXT '07.

[19]  Albert G. Greenberg,et al.  Network anomography , 2005, IMC '05.

[20]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[21]  Renata Teixeira,et al.  Dynamics of hot-potato routing in IP networks , 2004, SIGMETRICS '04/Performance '04.

[22]  Malgorzata Steinder,et al.  Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[23]  Franck Le,et al.  Minerals: using data mining to detect router misconfigurations , 2006, MineNet '06.

[24]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[25]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[26]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[27]  Jia Wang,et al.  Finding a needle in a haystack: pinpointing significant BGP routing changes in an IP network , 2005, NSDI.

[28]  Anja Feldmann,et al.  Locating internet routing instabilities , 2004, SIGCOMM '04.

[29]  G. S. Mudholkar Fisher's z‐Transformation , 2006 .

[30]  Lixin Gao,et al.  A measurement study on the impact of routing events on end-to-end internet path performance , 2006, SIGCOMM 2006.

[31]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[32]  Kenji Yamanishi,et al.  Dynamic syslog mining for network failure monitoring , 2005, KDD '05.