IT troubleshooting with drift analysis in the DevOps era

Over the past few years, DevOps practices have led to many changes in the software industry. The need for agility has resulted in continuous development and deployment of frequent small updates in IT production systems. However, the ever-changing applications and their IT operations environments challenge existing IT troubleshooting approaches, which generally depend on prebuilt domain knowledge and ignore the frequent changes in the DevOps era. Moreover, the complexity and diversity of application architectures exacerbate the challenges. In this paper, we propose an unsupervised learning based drift analysis tool named CHASER to detect and analyze abnormal changes (referred to as “drifts,” which include configuration errors, processes hanging, etc.), with learned change models and patterns in real time as well as in the root cause analysis. First, we categorize the changes into two distinct groups (static and dynamic state changes) and periodically collect the finer grained changes. Then, we extract the time-series and structural features from these changes and apply statistical and machine learning algorithms to learn models and patterns from historical data. Furthermore, we apply these models and patterns to detect drifts in real time and infer possible root causes of reported errors based on a multidimensional correlation approach to improve the precision. Through experiments and case studies, we demonstrate the capability of CHASER.

[1]  Wei-Ying Ma,et al.  Combining High Level Symptom Descriptions and Low Level State Information for Configuration Fault Diagnosis , 2004, LISA.

[2]  Helen J. Wang,et al.  Strider: a black-box, state-based approach to change and configuration management and support , 2003, Sci. Comput. Program..

[3]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[4]  Tianyin Xu,et al.  EnCore: exploiting system environment and correlation information for misconfiguration detection , 2014, ASPLOS.

[5]  Xiaohui Gu,et al.  UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems , 2012, ICAC '12.

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules and sequential patterns , 1996 .

[8]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[9]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[10]  Emre Kiciman,et al.  Discovering correctness constraints for self-management of system configuration , 2004 .

[11]  Xiaohui Gu,et al.  PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures , 2014, SoCC.

[12]  Dror G. Feitelson,et al.  Development and Deployment at Facebook , 2013, IEEE Internet Computing.

[13]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[14]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[15]  Helen J. Wang,et al.  Automatic Misconfiguration Troubleshooting with PeerPressure , 2004, OSDI.