Learning, indexing, and diagnosing network faults

Modern communication networks generate massive volume of operational event data, e.g., alarm, alert, and metrics, which can be used by a network management system (NMS) to diagnose potential faults. In this work, we introduce a new class of indexable fault signatures that encode temporal evolution of events generated by a network fault as well as topological relationships among the nodes where these events occur. We present an efficient learning algorithm to extract such fault signatures from noisy historical event data, and with the help of novel space-time indexing structures, we show how to perform efficient, online signature matching. We provide results from extensive experimental studies to explore the efficacy of our approach and point out potential applications of such signatures for many different types of networks including social and information networks.

[1]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[2]  John Moy,et al.  OSPF Version 2 , 1998, RFC.

[3]  Lundy M. Lewis,et al.  A Case-Based Reasoning Approach to the Resolution of Faults in Communication Networks , 1993, Integrated Network Management.

[4]  Haifeng Chen,et al.  Automatic Profiling of Network Event Sequences: Algorithm and Applications , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[5]  Yossi A. Nygate,et al.  Event correlation using rule and object based techniques , 1995, Integrated Network Management.

[6]  Edith Cohen,et al.  Reachability and distance queries via 2-hop labels , 2002, SODA '02.

[7]  Peng Wu,et al.  Alarm correlation engine (ACE) , 1998, NOMS 98 1998 IEEE Network Operations and Management Symposium.

[8]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[9]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[10]  Joan Feigenbaum,et al.  Learning-based anomaly detection in BGP updates , 2005, MineNet '05.

[11]  Anja Feldmann,et al.  Locating internet routing instabilities , 2004, SIGCOMM 2004.

[12]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[13]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[14]  Nick Feamster,et al.  Diagnosing network disruptions with network-wide analysis , 2007, SIGMETRICS '07.

[15]  H. Akaike A new look at the statistical model identification , 1974 .

[16]  Alexander Borgida,et al.  Efficient management of transitive relationships in large data and knowledge bases , 1989, SIGMOD '89.

[17]  Martin H. Levinson Linked: The New Science of Networks , 2004 .

[18]  Felix Salfner,et al.  Event-based Failure Prediction: An Extended Hidden Markov Model Approach , 2008, Ausgezeichnete Informatikdissertationen.

[19]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[20]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[21]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22]  Suman Nath,et al.  COLR-Tree: Communication-Efficient Spatio-Temporal Indexing for a Sensor Data Web Portal , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[23]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[24]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[25]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.