Communication-efficient Outlier Detection for Scale-out Systems

Modern scale-out services are built on top of large datacenters composed of thousands of individual machines. These must be continuously monitored because unexpected failures can overload fail-over mechanism and cause large-scale outages. Such monitoring can be accomplished by periodically measuring hundreds of performance metrics and looking for outliers, often caused by misconfigurations, hardware failures or even software bugs. Previous work has shown that many failures are indeed preceded by such performance outliers, known as performance problems or latent faults. In this work we adapt an existing unsupervised statistical framework for latent fault detection to provide an online, communicationand computation-reduced version. The existing framework is effective in predicting machine failures days before they happen, but requires each monitored machine to send all its periodic metric measurements, which is prohibitive in some settings and requires that the datacenter provide parallel storage and processing. Our adapted framework is able to reduce the amount of data sent and the processing cost at the central coordinator by processing the data in situ, making it usable in wider settings. We utilize techniques from the domain of stream processing, specifically sketching and safe zones, to trade-off accuracy for communication and computation, without compromising its advantages. Like the original framework, our adapted framework is unsupervised, does not require domain knowledge, and provides statistical guarantees on the rate of false positives. Initial experiments show that scores yielded by the adapted framework match the original scores very well, while reducing communications by over 90%.

[1]  Nikolaj Bjørner,et al.  Latent fault detection in large scale services , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[2]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[3]  Muli Ben-Yehuda,et al.  Vigilant: out-of-band detection of failures in virtual machines , 2008, OPSR.

[4]  Graham Cormode,et al.  The continuous distributed monitoring model , 2013, SGMD.

[5]  Rajeev Gandhi,et al.  Black-Box Problem Diagnosis in Parallel File Systems , 2010, FAST.

[6]  Assaf Schuster,et al.  Shape Sensitive Geometric Monitoring , 2012, IEEE Trans. Knowl. Data Eng..

[7]  Haifeng Chen,et al.  Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Graham Cormode,et al.  Sketching probabilistic data streams , 2007, SIGMOD '07.

[9]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[10]  Rajeev Gandhi,et al.  Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[11]  Michael Isard,et al.  Autopilot: automatic data center management , 2007, OPSR.

[12]  Assaf Schuster,et al.  A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams , 2010, Ubiquitous Knowledge Discovery.

[13]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[14]  Bronis R. de Supinski,et al.  Automatic fault characterization via abnormality-enhanced classification , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[15]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[16]  T. Abdelzaher,et al.  Achieving Scalable Automated Diagnosis of Distributed Systems Performance Problems , 2007 .

[17]  Rajeev Gandhi,et al.  Gumshoe: Diagnosing Performance Problems in Replicated File-Systems , 2008, 2008 Symposium on Reliable Distributed Systems.

[18]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[19]  Assaf Schuster,et al.  A geometric approach to monitoring threshold functions over distributed data streams , 2006, Ubiquitous Knowledge Discovery.

[20]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[21]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[22]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[23]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.