FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation

The failures of software service directly affect user experiences and service revenue. Thus operators monitor both service-level KPIs (e.g., response time) and machine-level KPIs (e.g., CPU usage) on each machine underlying the service. When a service fails, the operators must localize the root cause machines, and mitigate the failure as quickly as possible. Existing approaches have limited application due to the difficulty to obtain the required additional measurement data. As a result, failure localization is largely manual and very time-consuming. This paper presents FluxRank, a widely-deployable framework that can automatically and accurately localize the root cause machines, so that some actions can be triggered to mitigate the service failure. Our evaluation using historical cases from five real services (with tens of thousands of machines) of a top search company shows that the root cause machines are ranked top 1 (top 3) for 55 (66) cases out of 70 cases. Comparing to existing approaches, FluxRank cuts the localization time by more than 80% on average. FluxRank has been deployed online at one Internet service and six banking services for three months, and correctly localized the root cause machines as the top 1 for 55 cases out of 59 cases.

[1]  Barzan Mozafari,et al.  DBSherlock: A Performance Diagnostic Tool for Transactional Databases , 2016, SIGMOD Conference.

[2]  Jiawei Han,et al.  Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[3]  Xiaohui Gu,et al.  PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications , 2011, SLAML '11.

[4]  Daniel Massey,et al.  Argus: End-to-end service anomaly detection and localization from an ISP's point of view , 2012, 2012 Proceedings IEEE INFOCOM.

[5]  Kevin P. Murphy,et al.  Modeling changing dependency structure in multivariate time series , 2007, ICML '07.

[6]  Hongliang Fei,et al.  Anomaly localization for network data streams with graph joint sparse PCA , 2011, KDD.

[7]  Fredrik Gustafsson,et al.  Adaptive filtering and change detection , 2000 .

[8]  Shenglin Zhang,et al.  PreFix , 2018, PERV.

[9]  Paul Fearnhead,et al.  Exact and efficient Bayesian inference for multiple changepoint problems , 2006, Stat. Comput..

[10]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[11]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[12]  P. Warner Ordinal logistic regression , 2008, Journal of Family Planning and Reproductive Health Care.

[13]  Boris N. Oreshkin,et al.  Machine learning approaches to network anomaly detection , 2007 .

[14]  Xing Xie,et al.  Understanding transportation modes based on GPS data for web applications , 2010, TWEB.

[15]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[16]  Shenglin Zhang,et al.  Rapid and robust impact assessment of software changes in large internet-based services , 2015, CoNEXT.

[17]  Yang Feng,et al.  Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications , 2018, WWW.

[18]  Gayatri M. Bhandari,et al.  Audio Segmentation for Speech Recognition Using Segment Features , 2014 .

[19]  Balachander Krishnamurthy,et al.  Sketch-based change detection: methods, evaluation, and applications , 2003, IMC '03.

[20]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[21]  Shenglin Zhang,et al.  FUNNEL: Assessing Software Changes in Web-Based Services , 2018, IEEE Transactions on Services Computing.

[22]  Hiroshi Sawada,et al.  Change-Point Detection with Feature Selection in High-Dimensional Time-Series Data , 2013, IJCAI.

[23]  A. Aue,et al.  Break detection in the covariance structure of multivariate time series models , 2009, 0911.3796.

[24]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[25]  Chris D. Nugent,et al.  Evaluation of Prompted Annotation of Activity Data Recorded from a Smart Phone , 2014, Sensors.

[26]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[27]  Shenglin Zhang,et al.  LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs , 2019, IJCAI.

[28]  Eric R. Ziegel,et al.  Engineering Statistics , 2004, Technometrics.

[29]  Shenglin Zhang,et al.  Rapid Deployment of Anomaly Detection Models for Large Number of Emerging KPI Streams , 2018, 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC).

[30]  Young-Koo Lee,et al.  Comprehensive Context Recognizer Based on Multimodal Sensors in a Smartphone , 2012, Sensors.

[31]  Shenglin Zhang,et al.  HotSpot: Anomaly Localization for Additive KPIs With Multi-Dimensional Attributes , 2018, IEEE Access.

[32]  Deborah Estrin,et al.  Using mobile phones to determine transportation modes , 2010, TOSN.

[33]  D.J. Leith,et al.  Adaptive Kalman Filtering for anomaly detection in software appliances , 2008, IEEE INFOCOM Workshops 2008.

[34]  David Bernstein,et al.  Containers and Cloud: From LXC to Docker to Kubernetes , 2014, IEEE Cloud Computing.

[35]  Pengfei Chen,et al.  CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[36]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[37]  Dan Pei,et al.  Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning , 2015, Internet Measurement Conference.

[38]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[39]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[40]  J. Hartigan,et al.  A Bayesian Analysis for Change Point Problems , 1993 .

[41]  Tie-Yan Liu Learning to Rank for Information Retrieval , 2009, Found. Trends Inf. Retr..

[42]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.

[43]  Shenglin Zhang,et al.  PreFix: Switch Failure Prediction in Datacenter Networks , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[44]  Hermann Ney,et al.  Audio segmentation for speech recognition using segment features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  M Small,et al.  Complex network from pseudoperiodic time series: topology versus dynamics. , 2006, Physical review letters.

[46]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[47]  Jacob Benesty,et al.  Pearson Correlation Coefficient , 2009 .

[48]  Eamonn J. Keogh,et al.  Clustering of time-series subsequences is meaningless: implications for previous and future research , 2004, Knowledge and Information Systems.

[49]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[50]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[51]  Diane J. Cook,et al.  Automated Detection of Activity Transitions for Prompting , 2015, IEEE Transactions on Human-Machine Systems.

[52]  Shenglin Zhang,et al.  Syslog processing for switch failure diagnosis and prediction in datacenter networks , 2017, 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS).

[53]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM.

[54]  Yin Zhang,et al.  Rapid detection of maintenance induced changes in service performance , 2011, CoNEXT '11.

[55]  Su Fong Chien,et al.  ARIMA Based Network Anomaly Detection , 2010, 2010 Second International Conference on Communication Software and Networks.

[56]  Kenji Yamanishi,et al.  A unifying framework for detecting outliers and change points from non-stationary time series data , 2002, KDD.

[57]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[58]  Shenglin Zhang,et al.  Device-Agnostic Log Anomaly Classification with Partial Labels , 2018, 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS).

[59]  Xiaohui Gu,et al.  FChain: Toward Black-Box Online Fault Localization for Cloud Systems , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[60]  W. Pirie Spearman Rank Correlation Coefficient , 2006 .

[61]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[62]  Manuel Davy,et al.  An online kernel change detection algorithm , 2005, IEEE Transactions on Signal Processing.

[63]  Vanish Talwar,et al.  VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications , 2012, Middleware.

[64]  Yong Guan,et al.  A distributed data streaming algorithm for network-wide traffic anomaly detection , 2009, PERV.

[65]  Xin Huang,et al.  Robust and Rapid Adaption for Concept Drift in Software System Anomaly Detection , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[66]  Dan Pei,et al.  Threshold compression for 3G scalable monitoring , 2012, 2012 Proceedings IEEE INFOCOM.

[67]  Dan Pei,et al.  Mining causality graph for automatic web-based service diagnosis , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[68]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[69]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.

[70]  Lucas Lacasa,et al.  From time series to complex networks: The visibility graph , 2008, Proceedings of the National Academy of Sciences.