Localizing Faults in Cloud Systems

By leveraging large clusters of commodity hardware, the Cloud offers great opportunities to optimize the operative costs of software systems, but impacts significantly on the reliability of software applications. The lack of control of applications over Cloud execution environments largely limits the applicability of state-of-the-art approaches that address reliability issues by relying on heavyweight training with injected faults. In this paper, we propose LOUD, a lightweight fault localization approach that relies on positive training only, and can thus operate within the constraints of Cloud systems. LOUD relies on machine learning and graph theory. It trains machine learning models with correct executions only, and compensates the inaccuracy that derives from training with positive samples, by elaborating the outcome of machine learning techniques with graph theory algorithms. The experimental results reported in this paper confirm that LOUD can localize faults with high precision, by relying only on a lightweight positive training.

[1]  Xiaohui Gu,et al.  Ieee Transactions on Parallel and Distributed Systems (tpds) Perfcompass: Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-service Clouds , 2022 .

[2]  Rajeev Gandhi,et al.  Draco: Statistical diagnosis of chronic problems in large distributed systems , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[3]  Xin Chen,et al.  Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[4]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[5]  Alex C. Snoeren,et al.  Passive Realtime Datacenter Fault Detection and Localization , 2017, NSDI.

[6]  J. Friedman,et al.  THE NON-BACKTRACKING SPECTRUM OF THE UNIVERSAL COVER OF A GRAPH , 2007, 0712.0192.

[7]  Keith McCloghrie,et al.  Introduction to Community-based SNMPv2 , 1996, RFC.

[8]  Liming Zhu,et al.  POD-Diagnosis: Error Diagnosis of Sporadic Operations on Cloud Applications , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  Rajeev Gandhi,et al.  Ganesha: blackBox diagnosis of MapReduce systems , 2010, PERV.

[10]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[11]  Chita R. Das,et al.  CloudPD: Problem determination and diagnosis in shared dynamic clouds , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[12]  Nick Feamster,et al.  Characterizing correlated latency anomalies in broadband access networks , 2013, SIGCOMM.

[13]  Ricardo Bianchini,et al.  DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments , 2013, USENIX Annual Technical Conference.

[14]  George Varghese,et al.  Gestalt: Fast, Unified Fault Localization for Networked Systems , 2014, USENIX Annual Technical Conference.

[15]  Huaimin Wang,et al.  Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems , 2013, IEEE Transactions on Parallel and Distributed Systems.

[16]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[17]  Cemal Yilmaz,et al.  Seer: A Lightweight Online Failure Prediction Approach , 2017, 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC).

[18]  Leonardo Mariani,et al.  Dynamic Analysis for Diagnosing Integration Faults , 2011, IEEE Transactions on Software Engineering.

[19]  Ziming Zhang,et al.  Ensemble of Bayesian Predictors for Autonomic Failure Management in Cloud Computing , 2011, 2011 Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN).

[20]  Sonia Fahmy,et al.  NFV-VITAL: A framework for characterizing the performance of virtual network functions , 2015, 2015 IEEE Conference on Network Function Virtualization and Software Defined Network (NFV-SDN).

[21]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[22]  Xiao Zhang,et al.  Localization and centrality in networks , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Priya Narasimhan,et al.  Tiresias: Black-Box Failure Prediction in Distributed Systems , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[24]  Yan Liu,et al.  Temporal causal modeling with graphical granger methods , 2007, KDD '07.

[25]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[26]  Xiaohui Gu,et al.  UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems , 2012, ICAC '12.

[27]  Herodotos Herodotou,et al.  Scalable near real-time failure localization of data center networks , 2014, KDD.

[28]  Xiaohui Gu,et al.  PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures , 2014, SoCC.

[29]  Douglas C. Schmidt,et al.  Ultra-Large-Scale Systems: The Software Challenge of the Future , 2006 .

[30]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[31]  C. R. Ramakrishnan,et al.  Power Optimization in Fault-Tolerant Mobile Ad Hoc Networks , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[32]  Christoph Reich,et al.  Key Performance Indicators for Cloud Computing SLAs , 2013 .

[33]  Andreas Johnsson,et al.  Online network performance degradation localization using probabilistic inference and change detection , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[34]  Kahina Lazri,et al.  Anomaly Detection and Root Cause Localization in Virtual Network Functions , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[35]  Haifeng Chen,et al.  PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems , 2010, ICAC '10.

[36]  R. Johnston,et al.  The SAGE Handbook of Social Network Analysis , 2011 .

[37]  Ananthram Swami,et al.  Adaptive algorithms for diagnosing large-scale failures in computer networks , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[38]  Xiaohui Gu,et al.  PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[39]  Haifeng Chen,et al.  Fault detection and localization in distributed systems using invariant relationships , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[40]  Rajeev Gandhi,et al.  Kahuna: Problem diagnosis for Mapreduce-based cloud computing environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[41]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[42]  David Hutchison,et al.  Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines , 2010, Comput. Networks.

[43]  Erez Zadok,et al.  DARC: dynamic analysis of root causes of latency distributions , 2008, SIGMETRICS '08.

[44]  Eric Bauer,et al.  Reliability and Availability of Cloud Computing , 2012 .

[45]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.