Fault detection and localization in distributed systems using invariant relationships

Recent advances in sensing and communication technologies enable us to collect round-the-clock monitoring data from a wide-array of distributed systems including data centers, manufacturing plants, transportation networks, automobiles, etc. Often this data is in the form of time series collected from multiple sensors (hardware as well as software based). Previously, we developed a time-invariant relationships based approach that uses Auto-Regressive models with eXogenous input (ARX) to model this data. A tool based on our approach has been effective for fault detection and capacity planning in distributed systems. In this paper, we first describe our experience in applying this tool in real-world settings. We also discuss the challenges in fault localization that we face when using our tool, and present two approaches - a spatial approach based on invariant graphs and a temporal approach based on expected broken invariant patterns - that we developed to address this problem.

[1]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[2]  Deborah Estrin,et al.  Heartbeat of a nest: Using imagers as biological sensors , 2010, TOSN.

[3]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[4]  Haifeng Chen,et al.  Exploiting Local and Global Invariants for the Management of Large Scale Information Systems , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  Chao Liu,et al.  Mining Behavior Graphs for "Backtrace" of Noncrashing Bugs , 2005, SDM.

[6]  Devavrat Shah,et al.  Efficient rank aggregation using partial data , 2012, SIGMETRICS '12.

[7]  Haifeng Chen,et al.  Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[9]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[10]  George E. P. Box,et al.  Time Series Analysis: Box/Time Series Analysis , 2008 .

[11]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.

[12]  Haifeng Chen,et al.  A Data Analytic Engine Towards Self-Management of Cyber-Physical Systems , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems Workshops.

[13]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[14]  Lawrence B. Holder,et al.  Applying graph-based anomaly detection approaches to the discovery of insider threats , 2009, 2009 IEEE International Conference on Intelligence and Security Informatics.

[15]  Haifeng Chen,et al.  Invariants Based Failure Diagnosis in Distributed Computing Systems , 2010, 2010 29th IEEE Symposium on Reliable Distributed Systems.

[16]  Ramesh Govindan,et al.  Sensor faults: Detection methods and prevalence in real-world datasets , 2010, TOSN.

[17]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[18]  Diane J. Cook,et al.  Graph-based anomaly detection , 2003, KDD '03.