Efficient and Robust Syslog Parsing for Network Devices in Datacenter Networks

Syslog parsing is of vital importance for the detection, diagnosis and prediction of network device failures in a datacenter. A common approach to syslog parsing is to extract templates from historical syslogs, after which syslogs are matched to these templates. To address the problems in the existing syslog parsing techniques, we propose a novel framework, Craftsman, which identifies frequent combinations of (syslog) words and then applies them as templates. Craftsman empirically extracts templates accurately, is extremely efficient in template matching, and naturally supports incremental learning. To compare the performance of Craftsman and three other template learning techniques designed for network devices, we experiment them on two-years’ worth of syslogs collected from network devices deployed across 10+ datacenters of a tier-one service provider. The experiments demonstrate that Craftsman achieves a close-to-one accuracy (as measured by rand index), and improves the computational efficiency by 6.88 to 10.25 times in template matching, and by 730 to 6847 times in syslog parsing. It also improves the accuracy (as measured by F1 measure) of failure prediction by 13.07% to 188%. In addition, we demonstrate Craftsman’s superior generality by comparing it with three widely-applied log parsing methods over five large log datasets collected from servers, distributed systems and applications.

[1]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[2]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[3]  Alberto Sillitti,et al.  Failure prediction based on log files using Random Indexing and Support Vector Machines , 2013, J. Syst. Softw..

[4]  Shenglin Zhang,et al.  LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs , 2019, IJCAI.

[5]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[6]  Hiroshi Sawada,et al.  Change-Point Detection with Feature Selection in High-Dimensional Time-Series Data , 2013, IJCAI.

[7]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[8]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[9]  Akio Watanabe,et al.  Spatio-temporal factorization of log data for understanding network events , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[10]  Yin Zhang,et al.  Detecting the performance impact of upgrades in large operational networks , 2010, SIGCOMM '10.

[11]  Akio Watanabe,et al.  Proactive failure detection learning generation patterns of large-scale network logs , 2015, 2015 11th International Conference on Network and Service Management (CNSM).

[12]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[13]  A. Nur Zincir-Heywood,et al.  Fast entropy based alert detection in super computer logs , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[14]  Bo Zong,et al.  LogLens: A Real-Time Log Analysis System , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[15]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[16]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[17]  Feifei Li,et al.  Spell: Online Streaming Parsing of Large Unstructured System Logs , 2019, IEEE Transactions on Knowledge and Data Engineering.

[18]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[19]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[20]  Felix Salfner,et al.  Error Log Processing for Accurate Failure Prediction , 2008, WASL.

[21]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[22]  Shenglin Zhang,et al.  FUNNEL: Assessing Software Changes in Web-Based Services , 2018, IEEE Transactions on Services Computing.

[23]  Risto Vaarandi,et al.  A data clustering algorithm for mining patterns from event logs , 2003, Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764).

[24]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[25]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[26]  Shenglin Zhang,et al.  PreFix: Switch Failure Prediction in Datacenter Networks , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[27]  Jian Li,et al.  An Evaluation Study on Log Parsing and Its Use in Log Mining , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[28]  Ke Zhang,et al.  2016 Ieee International Conference on Big Data (big Data) Automated It System Failure Prediction: a Deep Learning Approach , 2022 .

[29]  Zibin Zheng,et al.  Tools and Benchmarks for Automated Log Parsing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[30]  Shenglin Zhang,et al.  Syslog processing for switch failure diagnosis and prediction in datacenter networks , 2017, 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS).

[31]  Navendu Jain,et al.  Demystifying the dark side of the middle: a field study of middlebox failures in datacenters , 2013, Internet Measurement Conference.

[32]  Xu Zhang,et al.  Robust log-based anomaly detection on unstable log data , 2019, ESEC/SIGSOFT FSE.

[33]  Mohamed Hefeeda,et al.  Real-time failure prediction in online services , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[34]  Keith Sklower,et al.  A Tree-Based Packet Routing Table for Berkeley Unix , 1991, USENIX Winter.

[35]  Risto Vaarandi,et al.  Mining event logs with SLCT and LogHound , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[36]  Tao Li,et al.  LogSig: generating system events from raw textual logs , 2011, CIKM '11.

[37]  Evangelos E. Milios,et al.  Clustering event logs using iterative partitioning , 2009, KDD.

[38]  Dan Pei,et al.  What happened in my network: mining network events from router syslogs , 2010, IMC '10.

[39]  Shenglin Zhang,et al.  Device-Agnostic Log Anomaly Classification with Partial Labels , 2018, 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).

[40]  Andreas Haeberlen,et al.  Diagnosing missing events in distributed systems with negative provenance , 2014, SIGCOMM.

[41]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[42]  Evangelos E. Milios,et al.  A Lightweight Algorithm for Message Type Extraction in System Application Logs , 2012, IEEE Transactions on Knowledge and Data Engineering.

[43]  Liang Tang,et al.  An integrated framework for optimizing automatic monitoring systems in large IT infrastructures , 2013, KDD.