Syslog processing for switch failure diagnosis and prediction in datacenter networks

Syslogs on switches are a rich source of information for both post-mortem diagnosis and proactive prediction of switch failures in a datacenter network. However, such information can be effectively extracted only through proper processing of syslogs, e.g., using suitable machine learning techniques. A common approach to syslog processing is to extract (i.e., build) templates from historical syslog messages and then match syslog messages to these templates. However, existing template extraction techniques either have low accuracies in learning the “correct” set of templates, or does not support incremental learning in the sense the entire set of templates has to be rebuilt (from processing all historical syslog messages again) when a new template is to be added, which is prohibitively expensive computationally if used for a large datacenter network. To address these two problems, we propose a frequent template tree (FT-tree) model in which frequent combinations of (syslog) words are identified and then used as message templates. FT-tree empirically extracts message templates more accurately than existing approaches, and naturally supports incremental learning. To compare the performance of FT-tree and three other template learning techniques, we experimented them on two-years' worth of failure tickets and syslogs collected from switches deployed across 10+ datacenters of a tier-1 cloud service provider. The experiments demonstrated that FT-tree improved the estimation/prediction accuracy (as measured by F1) by 155% to 188%, and the computational efficiency by 117 to 730 times.

[1]  Yin Zhang,et al.  Rapid detection of maintenance induced changes in service performance , 2011, CoNEXT '11.

[2]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[3]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[4]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[5]  Dan Pei,et al.  What happened in my network: mining network events from router syslogs , 2010, IMC '10.

[6]  Walter Willinger,et al.  Spatio-temporal compressive sensing and internet traffic matrices , 2009, SIGCOMM '09.

[7]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[8]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[9]  Akio Watanabe,et al.  Spatio-temporal factorization of log data for understanding network events , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[10]  Yin Zhang,et al.  Detecting the performance impact of upgrades in large operational networks , 2010, SIGCOMM '10.

[11]  Akio Watanabe,et al.  Proactive failure detection learning generation patterns of large-scale network logs , 2015, 2015 11th International Conference on Network and Service Management (CNSM).

[12]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[13]  Felix Salfner,et al.  Error Log Processing for Accurate Failure Prediction , 2008, WASL.

[14]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[15]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[16]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[17]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[18]  Nick McKeown,et al.  I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks , 2014, NSDI.

[19]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[20]  Kenji Yamanishi,et al.  Dynamic syslog mining for network failure monitoring , 2005, KDD '05.

[21]  Mohamed Hefeeda,et al.  Real-time failure prediction in online services , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[22]  Xin Wu,et al.  NetPilot: automating datacenter network failure mitigation , 2012, SIGCOMM '12.

[23]  Hiroshi Sawada,et al.  Change-Point Detection with Feature Selection in High-Dimensional Time-Series Data , 2013, IJCAI.

[24]  Andreas Haeberlen,et al.  Diagnosing missing events in distributed systems with negative provenance , 2014, SIGCOMM.

[25]  Alberto Sillitti,et al.  Failure prediction based on log files using Random Indexing and Support Vector Machines , 2013, J. Syst. Softw..