Online Behavior Identification in Distributed Systems

The diagnosis, prediction, and understanding of unexpected behavior is crucial for long running, large scale distributed systems. However, existing works focus on the identification of faults in specific time moments preceded by significantly abnormal metric readings, or require a previous analysis of historical failure data. In this work, we propose an online behavior classification system to identify a wide range of undesired behaviors, which may appear even in healthy systems, and their evolution over time. We employ a two-step process involving two online classifiers on periodically collected system metrics to identify at runtime normal and anomalous behaviors such as deadlock, starvation and livelock, without any previous analysis of historical failure data. Our approach achieves over 80% accuracy in detecting unexpected behaviors and over 90% accuracy in identifying their type with a short delay after the anomalies appear, and with minimal expert intervention. Our experimental analysis uses system execution traces obtained from a Google cluster and from our in-house distributed system with varied behaviors, and shows the benefits of our approach as well as future research challenges.

[1]  R. V. van Nieuwpoort,et al.  The Grid 2: Blueprint for a New Computing Infrastructure , 2003 .

[2]  Kivanc M. Ozonat An information-theoretic approach to detecting performance anomalies and changes for large-scale distributed web services , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[3]  Gerardo Pardo-Castellote,et al.  OMG Data-Distribution Service: architectural overview , 2003, 23rd International Conference on Distributed Computing Systems Workshops, 2003. Proceedings..

[4]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[5]  Rüdiger Schollmeier,et al.  A definition of peer-to-peer networking for the classification of peer-to-peer architectures and applications , 2001, Proceedings First International Conference on Peer-to-Peer Computing.

[6]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[7]  R. J. Alcock,et al.  Time-Series Similarity Queries Employing a Feature-Based Approach , 1999 .

[8]  Gabor Karsai,et al.  Composing Domain-Specific Design Environments , 2001, Computer.

[9]  Van Jacobson,et al.  The synchronization of periodic routing messages , 1993, SIGCOMM '93.

[10]  Jacques Ferber,et al.  Multi-agent systems - an introduction to distributed artificial intelligence , 1999 .

[11]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[12]  Jeffrey C. Mogul,et al.  Emergent (mis)behavior vs. complex software systems , 2006, EuroSys.

[13]  Aniruddha S. Gokhale,et al.  A platform-independent component modeling language for distributed real-time and embedded systems , 2005, 11th IEEE Real Time and Embedded Technology and Applications Symposium.

[14]  Hyeran Byun,et al.  Applications of Support Vector Machines for Pattern Recognition: A Survey , 2002, SVM.

[15]  Yaneer Bar-Yam,et al.  Dynamics Of Complex Systems , 2019 .

[16]  Song Fu,et al.  Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[17]  Haixun Wang,et al.  Adaptive system anomaly prediction for large-scale hosting infrastructures , 2010, PODC.

[18]  Douglas C. Schmidt,et al.  Applying System Execution Modeling Tools to Evaluate Enterprise Distributed Real-time and Embedded System QoS , 2006, 12th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'06).

[19]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[20]  Priya Narasimhan,et al.  Tiresias: Black-Box Failure Prediction in Distributed Systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[21]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[22]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[23]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[24]  Haixun Wang,et al.  Online Anomaly Prediction for Robust Cluster Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[25]  Gabor Karsai,et al.  Smart Dust: communicating with a cubic-millimeter computer , 2001 .

[26]  Saurabh Bagchi,et al.  Automatic Problem Localization via Multi-dimensional Metric Profiling , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[27]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.