Root cause detection in a service-oriented architecture

Large-scale websites are predominantly built as a service-oriented architecture. Here, services are specialized for a certain task, run on multiple machines, and communicate with each other to serve a user's request. An anomalous change in a metric of one service can propagate to other services during this communication, resulting in overall degradation of the request. As any such degradation is revenue impacting, maintaining correct functionality is of paramount concern: it is important to find the root cause of any anomaly as quickly as possible. This is challenging because there are numerous metrics or sensors for a given service, and a modern website is usually composed of hundreds of services running on thousands of machines in multiple data centers. This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean average precision in finding root causes compared to baseline and current state-of-the-art methods.

[1]  Wenpu Xing,et al.  Weighted PageRank algorithm , 2004, Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004..

[2]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[3]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[4]  Wilhelm Hasselbring,et al.  Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems Based on Timing Behavior Anomaly Correlation , 2009, 2009 13th European Conference on Software Maintenance and Reengineering.

[5]  Thomas Reidemeister,et al.  Dependency-aware fault diagnosis with metric-correlation models in enterprise software systems , 2010, 2010 International Conference on Network and Service Management.

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Xi Chen,et al.  Direct Robust Matrix Factorizatoin for Anomaly Detection , 2011, 2011 IEEE 11th International Conference on Data Mining.

[8]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[9]  Kai Ming Ting,et al.  Fast Anomaly Detection for Streaming Data , 2011, IJCAI.

[10]  Daniel Massey,et al.  Argus: End-to-end service anomaly detection and localization from an ISP's point of view , 2012, 2012 Proceedings IEEE INFOCOM.

[11]  Hongliang Fei,et al.  Anomaly localization for network data streams with graph joint sparse PCA , 2011, KDD.

[12]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[13]  Evgenia Smirni,et al.  Automated anomaly detection and performance modeling of enterprise applications , 2009, TOCS.

[14]  Yin Zhang,et al.  Rapid detection of maintenance induced changes in service performance , 2011, CoNEXT '11.

[15]  Ramesh Govindan,et al.  Sensor faults: Detection methods and prevalence in real-world datasets , 2010, TOSN.

[16]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[17]  Vanish Talwar,et al.  VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications , 2012, Middleware.

[18]  C. S. Chao,et al.  An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation , 2004, Journal of Network and Systems Management.

[19]  H. Stanley,et al.  Optimizing the success of random searches , 1999, Nature.

[20]  Yong Guan,et al.  A distributed data streaming algorithm for network-wide traffic anomaly detection , 2009, PERV.

[21]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[22]  Jiawei Han,et al.  Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[23]  Constantine Caramanis,et al.  Robust PCA via Outlier Pursuit , 2010, IEEE Transactions on Information Theory.

[24]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[25]  Fan Yang,et al.  Progress in root cause and fault propagation analysis of large-scale industrial processes , 2012 .

[26]  Andrew M. Hein,et al.  Sensing and decision-making in random search , 2012, Proceedings of the National Academy of Sciences.

[27]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[28]  Boris N. Oreshkin,et al.  Machine learning approaches to network anomaly detection , 2007 .

[29]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[30]  Tao Xia,et al.  SDG multiple fault diagnosis by real-time inverse inference , 2005, Reliab. Eng. Syst. Saf..

[31]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[32]  Ali Jalali,et al.  Learning the Dependence Graph of Time Series with Latent Factors , 2011, ICML.

[33]  Jiawei Han,et al.  Dustminer: troubleshooting interactive complexity bugs in sensor networks , 2008, SenSys '08.

[34]  Mike P. Papazoglou,et al.  Service oriented architectures: approaches, technologies and research issues , 2007, The VLDB Journal.

[35]  Ruy Luiz Milidiú,et al.  Data stream anomaly detection through principal subspace tracking , 2010, SAC '10.

[36]  Jiawei Han,et al.  DIAMOND: Correlation-Based Anomaly Monitoring Daemon for DIME , 2010, 2010 IEEE International Symposium on Multimedia.

[37]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[38]  Vanish Talwar,et al.  A flexible architecture integrating monitoring and analytics for managing large-scale data centers , 2011, ICAC '11.