Analysis and Diagnosis of SLA Violations in a Production SaaS Cloud

A software-as-a-service (SaaS) needs to provide its intended service as per its stated service-level agreements (SLAs). While SLA violations in a SaaS platform have been reported, not much work has been done to empirically characterize failures of SaaS. In this paper, we study SLA violations of a production SaaS platform, diagnose the causes, unearth several critical failure modes, and then, suggest various solution approaches to increase the availability of the platform as perceived by the end user. Our approach combines field failure data analysis (FFDA) and fault injection. Our study is based on 283 days of operational logs of the platform. During this time, the platform received business workload from 42 customers spread over 22 countries. We have first developed a set of home-grown FFDA tools to analyze the log, and second implemented a fault injector to automatically inject several runtime errors in the application code written in .NET/C#, and then, collate the injection results. We summarize our finding as: first, system failures have caused 93% of all SLA violations; second, our fault injector has been able to recreate a few cases of bursts of SLA violations that could not be diagnosed from the logs; and third, the fault injection mechanism could recreate several error propagation paths leading to data corruptions that the failure data analysis could not reveal. Finally, the paper presents some system-level implication of this study and how the joint use of fault injection and log analysis may help in improving the reliability of the measured platform.

[1]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[2]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[3]  Philip J. Guo,et al.  Characterizing and predicting which bugs get fixed: an empirical study of Microsoft Windows , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[4]  Catello Di Martino One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window Approach , 2013, ISC.

[5]  Zude Li,et al.  Diagnosing new faults using mutants and prior faults (NIER track) , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[6]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[7]  Xiao Ma,et al.  An empirical study on configuration errors in commercial and open source systems , 2011, SOSP.

[8]  Ravishankar K. Iyer,et al.  Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[10]  Ravishankar K. Iyer,et al.  Characterization of operational failures from a business data processing SaaS platform , 2014, ICSE Companion.

[11]  Gil Neiger,et al.  Intel virtualization technology , 2005, Computer.

[12]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[13]  Domenico Cotroneo,et al.  Collecting and Analyzing Failure Data of Bluetooth Personal Area Networks , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[14]  Ravishankar K. Iyer,et al.  Analysis and Diagnosis of SLA Violations in a Production SaaS Cloud , 2017, IEEE Transactions on Reliability.

[15]  Song Fu,et al.  CDA: A Cloud Dependability Analysis Framework for Characterizing System Dependability in Cloud Computing Infrastructures , 2012, 2012 IEEE 18th Pacific Rim International Symposium on Dependable Computing.

[16]  Galen C. Hunt,et al.  Debugging in the (very) large: ten years of implementation and experience , 2009, SOSP '09.

[17]  Mitsuhisa Sato,et al.  D-Cloud: Design of a Software Testing Environment for Reliable Distributed Systems Using Cloud Computing Technology , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[18]  Bojan Cukic,et al.  Log-Based Reliability Analysis of Software as a Service (SaaS) , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[19]  Domenico Cotroneo,et al.  Effective fault treatment for improving the dependability of COTS and legacy-based applications , 2004, IEEE Transactions on Dependable and Secure Computing.

[20]  Dongmei Zhang,et al.  Performance debugging in the large via mining millions of stack traces , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[21]  Norman E. Fenton,et al.  Quantitative Analysis of Faults and Failures in a Complex Software System , 2000, IEEE Trans. Software Eng..

[22]  Domenico Cotroneo,et al.  Automated Generation of Performance and Dependability Models for the Assessment of Wireless Sensor Networks , 2012, IEEE Transactions on Computers.

[23]  Sheng Ma,et al.  Automated Problem Determination Using Call-Stack Matching , 2005, Journal of Network and Systems Management.

[24]  Stefano Russo,et al.  Detection of Software Failures through Event Logs: An Experimental Study , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering.

[25]  Brendan Murphy,et al.  Characterizing the differences between pre- and post- release versions of software , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[26]  Myra B. Cohen,et al.  Regression testing in Software as a Service: An industrial case study , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[27]  Ravishankar K. Iyer,et al.  LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications , 2015, FTXS@HPDC.

[28]  Ravishankar K. Iyer,et al.  Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.

[29]  Ravishankar K. Iyer,et al.  Group communication protocols under errors , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[30]  Wei Lin,et al.  A characteristic study on failures of production distributed data-parallel programs , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[31]  Ravishankar K. Iyer,et al.  CloudVal: A framework for validation of virtualization environment in cloud infrastructure , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[32]  Domenico Cotroneo,et al.  Event Logs for the Analysis of Software Failures: A Rule-Based Approach , 2013, IEEE Transactions on Software Engineering.