How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

[1]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[2]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[3]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[4]  Peter M. Chen,et al.  The Design and Verification of the Rio File Cache , 2001, IEEE Trans. Computers.

[5]  Heng Li,et al.  Which log level should developers choose for a new logging statement? , 2017, Empirical Software Engineering.

[6]  Ding Yuan,et al.  Improving Software Diagnosability via Log Enhancement , 2012, TOCS.

[7]  Subhajit Roy,et al.  Bug synthesis: challenging bug-finding tools with deep faults , 2018, ESEC/SIGSOFT FSE.

[8]  Liming Zhu,et al.  Process-Oriented Non-intrusive Recovery for Sporadic Operations on Cloud , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[9]  Jeffrey M. Voas,et al.  Predicting How Badly "Good" Software Can Behave , 1997, IEEE Softw..

[10]  Shin Yoo,et al.  Are Mutation Scores Correlated with Real Fault Detection? A Large Scale Empirical Study on the Relationship Between Mutants and Real Faults , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[11]  Gabriele Bavota,et al.  Learning How to Mutate Source Code from Bug-Fixes , 2018, 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[12]  Matias Martinez,et al.  Automatically Extracting Instances of Code Change Patterns with AST Analysis , 2013, 2013 IEEE International Conference on Software Maintenance.

[13]  Ravishankar K. Iyer,et al.  CloudVal: A framework for validation of virtualization environment in cloud infrastructure , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[14]  Luciano Baresi,et al.  A comparison framework for runtime monitoring approaches , 2017, J. Syst. Softw..

[15]  Eric Bauer,et al.  Reliability and Availability of Cloud Computing: Bauer/Cloud Computing , 2012 .

[16]  Henrique Madeira,et al.  Recovery for Virtualized Environments , 2015, 2015 11th European Dependable Computing Conference (EDCC).

[17]  Yves Le Traon,et al.  Chapter Six - Mutation Testing Advances: An Analysis and Survey , 2019, Adv. Comput..

[18]  Ravishankar K. Iyer,et al.  Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[19]  Pascale Thévenod-Fosse,et al.  Software error analysis: a real case study involving real faults and mutations , 1996, ISSTA '96.

[20]  Neeraj Suri,et al.  An empirical study of injected versus actual interface errors , 2014, ISSTA 2014.

[21]  William K. Robertson,et al.  LAVA: Large-Scale Automated Vulnerability Addition , 2016, 2016 IEEE Symposium on Security and Privacy (SP).

[22]  Liming Zhu,et al.  An Empirical Study of Cloud API Issues , 2018, IEEE Cloud Computing.

[23]  Ram Chillarege,et al.  Generation of an error set that emulates software faults based on field data , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[24]  Na Meng,et al.  Towards reusing hints from past fixes , 2017, Empirical Software Engineering.

[25]  Domenico Cotroneo,et al.  Combining Operational and Debug Testing for Improving Reliability , 2013, IEEE Transactions on Reliability.

[26]  George Candea,et al.  Crash-Only Software , 2003, HotOS.

[27]  Yang Liu,et al.  Be conservative: enhancing failure diagnosis with proactive logging , 2012, OSDI 2012.

[28]  Cristiano Giuffrida,et al.  EDFI: A Dependable Fault Injection Tool for Dependability Benchmarking Experiments , 2013, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.

[29]  Qiang Fu,et al.  Learning to Log: Helping Developers Make Informed Logging Decisions , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[30]  Henrique Madeira,et al.  Emulation of Software Faults: A Field Data Study and a Practical Approach , 2006, IEEE Transactions on Software Engineering.

[31]  Ann Q. Gates,et al.  A taxonomy and catalog of runtime software-fault monitoring tools , 2004, IEEE Transactions on Software Engineering.

[32]  Mark Harman,et al.  An Analysis and Survey of the Development of Mutation Testing , 2011, IEEE Transactions on Software Engineering.

[33]  Domenico Cotroneo,et al.  Assessing Dependability with Software Fault Injection , 2016, ACM Comput. Surv..

[34]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[35]  Herbert Bos,et al.  Can we make operating systems reliable and secure? , 2006, Computer.

[36]  Kang G. Shin,et al.  On fault resilience of OpenStack , 2013, SoCC.

[37]  Len Bass,et al.  Rollback Mechanisms for Cloud Management APIs Using AI Planning , 2020, IEEE Transactions on Dependable and Secure Computing.

[38]  Sunghun Kim,et al.  Toward an understanding of bug fix patterns , 2009, Empirical Software Engineering.

[39]  Ravishankar K. Iyer,et al.  Failure Diagnosis for Distributed Systems Using Targeted Fault Injection , 2017, IEEE Transactions on Parallel and Distributed Systems.

[40]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[41]  Leonard J. Bass,et al.  Automatic Undo for Cloud Management via AI Planning , 2012, HotDep.

[42]  Foutse Khomh,et al.  Experience Report: An Empirical Study of API Failures in OpenStack Cloud Environments , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[43]  Jesús M. González-Barahona,et al.  What if a bug has a different origin?: making sense of bugs without an explicit bug introducing change , 2018, ESEM.

[44]  Domenico Cotroneo,et al.  Analysis and Prediction of Mandelbugs in an Industrial Software System , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[45]  Michael R. Lyu Software Reliability Engineering: A Roadmap , 2007, Future of Software Engineering (FOSE '07).

[46]  Ingo Weber,et al.  Metric selection and anomaly detection for cloud operations using log and metric correlation analysis , 2017, J. Syst. Softw..

[47]  Lionel C. Briand,et al.  Is mutation an appropriate tool for testing experiments? , 2005, ICSE.

[48]  Grigore Rosu,et al.  Mop: an efficient and generic runtime verification framework , 2007, OOPSLA.

[49]  Herbert Bos,et al.  Fault isolation for device drivers , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[50]  Eric Bauer,et al.  Reliability and Availability of Cloud Computing , 2012 .

[51]  Michael D. Ernst,et al.  Are mutants a valid substitute for real faults in software testing? , 2014, SIGSOFT FSE.

[52]  Zibin Zheng,et al.  A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems , 2014, 2014 IEEE International Symposium on Software Reliability Engineering Workshops.

[53]  Kishor S. Trivedi,et al.  Fighting bugs: remove, retry, replicate, and rejuvenate , 2007, Computer.

[54]  Jean Arlat,et al.  Dependability of COTS Microkernel-Based Systems , 2002, IEEE Trans. Computers.

[55]  Vincenzo De Florio,et al.  A survey of linguistic structures for application-level fault tolerance , 2008, CSUR.

[56]  Andrey Markelov How to Build Your Own Virtual Test Environment , 2016 .

[57]  Koushik Sen,et al.  PREFAIL: a programmable tool for multiple-failure injection , 2011, OOPSLA '11.

[58]  Michele Marchesi,et al.  A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering , 2015, PROMISE.

[59]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.