Log Clustering Based Problem Identification for Online Service Systems

Logs play an important role in the maintenance of large-scale online service systems. When an online service fails, engineers need to examine recorded logs to gain insights into the failure and identify the potential problems. Traditionally, engineers perform simple keyword search (such as “error” and “exception”) of logs that may be associated with the failures. Such an approach is often time consuming and error prone. Through our collaboration with Microsoft service product teams, we propose LogCluster, an approach that clusters the logs to ease log-based problem identification. LogCluster also utilizes a knowledge base to check if the log sequences occurred before. Engineers only need to examine a small number of previously unseen, representative log sequences extracted from the clusters to identify a problem, thus significantly reducing the number of logs that should be examined, meanwhile improving the identification accuracy. Through experiments on two Hadoop-based applications and two large-scale Microsoft online service systems, we show that our approach is effective and outperforms the state-of-the-art work proposed by Shang et al. in ICSE 2013. We have successfully applied LogCluster to the maintenance of many actual Microsoft online service systems. In this paper, we also share our success stories and lessons learned.

[1]  David Leon,et al.  Finding failures by cluster analysis of execution profiles , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[2]  Gilbert Hamann,et al.  Abstracting Execution Logs to Execution Events for Enterprise Applications (Short Paper) , 2008, 2008 The Eighth International Conference on Quality Software.

[3]  Wei Xu,et al.  Experience Mining Google's Production Console Logs , 2010, SLAML.

[4]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[5]  Qiang Fu,et al.  Where do developers log? an empirical study on logging practices in industry , 2014, ICSE Companion.

[6]  Qiang Fu,et al.  Healing online service systems via mining historical issue repositories , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[7]  Ahmed E. Hassan,et al.  Studying the relationship between logging characteristics and the code quality of platform software , 2015, Empirical Software Engineering.

[8]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[9]  Ahmed E. Hassan,et al.  Understanding Log Lines Using Development Knowledge , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[10]  Thomas Reidemeister,et al.  Mining unstructured log files for recurrent fault diagnosis , 2011, 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops.

[11]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[12]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[13]  References , 1971 .

[14]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Rongxin Wu,et al.  CrashLocator: locating crashing faults based on crash stacks , 2014, ISSTA 2014.

[16]  Jiawei Han,et al.  Classification of software behaviors for failure detection: a discriminative pattern mining approach , 2009, KDD.

[17]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[18]  Ding Yuan,et al.  Characterizing logging practices in open-source software , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[19]  Qiang Fu,et al.  Learning to Log: Helping Developers Make Informed Logging Decisions , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[20]  Ding Yuan,et al.  Improving Software Diagnosability via Log Enhancement , 2012, TOCS.

[21]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[22]  Patrick Martin,et al.  Assisting developers of Big Data Analytics Applications when deploying on Hadoop clouds , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[23]  Qiang Fu,et al.  Mining Historical Issue Repositories to Heal Large-Scale Online Service Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[24]  Qiang Fu,et al.  Mining Invariants from Console Logs for System Problem Detection , 2010, USENIX Annual Technical Conference.

[25]  Barton P. Miller,et al.  Problem Diagnosis in Large-Scale Computing Environments , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[26]  Qiang Fu,et al.  Log2: A Cost-Aware Logging Mechanism for Performance Diagnosis , 2015, USENIX Annual Technical Conference.