Bad Words: Finding Faults in Spirit's Syslogs

Accurate fault detection is a key element of resilient computing. Syslogs provide key information regarding faults, and are found on nearly all computing systems. Discovering new fault types requires expert human effort, however, as no previous algorithm has been shown to localize faults in time and space with an operationally acceptable false positive rate. We present experiments on three weeks of syslogs from Sandia's 512-node "Spirit" Linux cluster, showing one algorithm that localizes 50% of faults with 75% precision, corresponding to an excellent false positive rate of 0.05%. The salient characteristics of this algorithm are (1) calculation of nodewise information entropy, and (2) encoding of word position. The key observation is that similar computers correctly executing similar work should produce similar logs.

[1]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[2]  Risto Vaarandi,et al.  SEC - a lightweight event correlation tool , 2002, IEEE Workshop on IP Operations and Management.

[3]  James E. Prewett Incorporating information from a cluster batch scheduler and center management software into automated log file analysis , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[4]  George Candea,et al.  Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[5]  Ulrich Flegel Pseudonymizing Unix Log Files , 2002, InfraSec.

[6]  Risto Vaarandi,et al.  A data clustering algorithm for mining patterns from event logs , 2003, Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764).

[7]  Alva L. Couch,et al.  Visualizing Huge Tracefiles with Xscal , 1996, LISA.

[8]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[9]  Stephen E. Hansen,et al.  Automated System Monitoring and Notification with Swatch , 1993, LISA.

[10]  Jane Radatz,et al.  The IEEE Standard Dictionary of Electrical and Electronics Terms , 1997 .

[11]  Chris Lonvick,et al.  The BSD Syslog Protocol , 2001, RFC.

[12]  Alva L. Couch,et al.  Peep (The Network Auralizer): Monitoring Your Network with Sound , 2000, LISA.

[13]  V. Rao Vemuri,et al.  Using Text Categorization Techniques for Intrusion Detection , 2002, USENIX Security Symposium.

[14]  Joseph L. Hellerstein,et al.  Mining partially periodic event patterns with unknown periods , 2001, Proceedings 17th International Conference on Data Engineering.

[15]  Dominique Brodbeck,et al.  A Visual Approach for Monitoring Logs , 1998, LISA.

[16]  John R. Reuning Applying Term Weight Techniques to Event Log Analysis for Intrusion Detection , 2004 .

[17]  Hideki Koike,et al.  Tudumi: information visualization system for monitoring and auditing computer logs , 2002, Proceedings Sixth International Conference on Information Visualisation.

[18]  Joseph L. Hellerstein,et al.  Discovering actionable patterns in event data , 2002, IBM Syst. J..

[19]  Tetsuji Takada,et al.  MieLog: A Highly Interactive Visual Log Browser Using Information Visualization and Statistical Analysis , 2002, LISA.

[20]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[21]  Marc Dacier,et al.  An Intrusion-Detection System Based on the Teiresias Pattern- Discovery Algorithm , 1999 .

[22]  Risto Vaarandi,et al.  A Breadth-First Algorithm for Mining Frequent Patterns from Event Logs , 2004, INTELLCOMM.

[23]  John Stearley,et al.  Towards informatic analysis of syslogs , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).