Network-wide Information Correlation and Exploration ( NICE ) : Framework , Applications , and Experience

Scalable event detection and trouble shooting capabilities are critical for ensuring high levels of network reliability and performance. Although network operations systems are typically well designed for dealing with hard network outages (e.g., link failures), detecting and analyzing chronic conditions particularly those associated with short term performance impairments still remains challenging. Detecting and trouble shooting such conditions typically requires detailed analysis of data collected from different monitoring tools, to obtain a comprehensive view of network events. This is typically performed manually, making it an imperfect, time consuming and costly process. The ability to perform correlations is a fundamental yet powerful building block when it comes to analyzing multiple data series collectively. We present a novel framework, NICE (Network-wide Information Correlation and Exploration), that scalably analyzes network-wide statistical event correlations. The core components of NICE include a flexible infrastructure for pair-wise correlation testing as well as tools for subsequent analysis of resulting correlation patterns and automatic drill-down for surprising correlations. Above our core NICE infrastructure, we have prototyped two exciting applications: (i) for trouble-shooting known problems, and (ii) for discovering undesirable modes of network operation that may traditionally have been flying under the operations team’s radar, yet potentially impacting customers. We evaluate the accuracy of NICE by examining several data streams from a tier-1 ISP backbone network. We also present case studies that demonstrate the efficacy of our tool-kit by revealing surprising correlations for the same tier-1 network. The NICE methodology and algorithms promise to be of immense use to network operators in analyzing network behavior and identifying anomalous network conditions.

[1]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[2]  Rick Greer,et al.  Daytona and the fourth-generation language Cymbal , 1999, SIGMOD '99.

[3]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[4]  Malgorzata Steinder,et al.  Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[5]  D. Sheskin The Pearson Product-Moment Correlation Coefficient , 2003 .

[6]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[7]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[8]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[9]  Anja Feldmann,et al.  Locating internet routing instabilities , 2004, SIGCOMM '04.

[10]  Renata Teixeira,et al.  Dynamics of hot-potato routing in IP networks , 2004, SIGMETRICS '04/Performance '04.

[11]  Albert G. Greenberg,et al.  OSPF Monitoring: Architecture, Design, and Deployment Experience , 2004, NSDI.

[12]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[13]  Piotr Cholda,et al.  Network Recovery, Protection and Restoration of Optical, SONET-SDH, IP, and MPLS [Book Review] , 2005, IEEE Communications Magazine.

[14]  Christos Faloutsos,et al.  BRAID: stream mining through group lag correlations , 2005, SIGMOD '05.

[15]  Jia Wang,et al.  Finding a needle in a haystack: pinpointing significant BGP routing changes in an IP network , 2005, NSDI.

[16]  Albert G. Greenberg,et al.  IP fault localization via risk modeling , 2005, NSDI.

[17]  Srikanth Kandula,et al.  Shrink: a tool for failure diagnosis in IP networks , 2005, MineNet '05.

[18]  Kenji Yamanishi,et al.  Dynamic syslog mining for network failure monitoring , 2005, KDD '05.

[19]  Franck Le,et al.  Minerals: using data mining to detect router misconfigurations , 2006, MineNet '06.

[20]  G. S. Mudholkar Fisher's z‐Transformation , 2006 .

[21]  Wilfred Ng,et al.  Mining quantitative correlated patterns using an information-theoretic approach , 2006, KDD '06.

[22]  Lixin Gao,et al.  A measurement study on the impact of routing events on end-to-end internet path performance , 2006, SIGCOMM.

[23]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[24]  Nick Feamster,et al.  Diagnosing network disruptions with network-wide analysis , 2007, SIGMETRICS '07.

[25]  A. Greenberg,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.