Understanding Exception-Related Bugs in Large-Scale Cloud Systems

Exception mechanism is widely used in cloud systems. This is mainly because it separates the error handling code from main business logic. However, the huge space of potential error conditions and the sophisticated logic of cloud systems present a big hurdle to the correct use of exception mechanism. As a result, mistakes in the exception use may lead to severe consequences, such as system downtime and data loss. To address this issue, the communities direly need a better understanding of the exception-related bugs, i.e., eBugs, which are caused by the incorrect use of exception mechanism, in cloud systems. In this paper, we present a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper. For all the studied eBugs, we analyze their triggering conditions, root causes, bug impacts, and their relations. To the best of our knowledge, this is the first study on eBugs in cloud systems, and the first one that focuses on triggering conditions. We find that eBugs are severe in cloud systems: 74% of our studied eBugs affect system availability or integrity. Luckily, exposing eBugs through testing is possible: 54% of the eBugs are triggered by non-semantic conditions, such as network errors; 40% of the eBugs can be triggered by simulating the triggering conditions at simple system states. Furthermore, we find that the triggering conditions are useful for detecting eBugs. Based on such relevant findings, we build a static analysis tool, called DIET, and apply it to the latest versions of the studied systems. Our results show that DIET reports 31 bugs and bad practices, and 23 of them are confirmed by the developers as "previously-unknown" ones.

[1]  Yuanyuan Zhou,et al.  CTrigger: exposing atomicity violation bugs from their hiding places , 2009, ASPLOS.

[2]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[3]  Weiyi Shang,et al.  Revisiting Exception Handling Practices with Exception Flow Analysis , 2017, 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[4]  Mark Lillibridge,et al.  Torturing Databases for Fun and Profit , 2014, OSDI.

[5]  Gogul Balakrishnan,et al.  Interprocedural Exception Analysis for C++ , 2011, ECOOP.

[6]  Arie van Deursen,et al.  Unveiling Exception Handling Bug Hazards in Android Based on GitHub and Google Code Issues , 2015, MSR.

[7]  Dawson R. Engler,et al.  KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs , 2008, OSDI.

[8]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[9]  Shan Lu,et al.  ConMem: detecting severe concurrency bugs through an effect-oriented approach , 2010, ASPLOS XV.

[10]  Robert Mateescu,et al.  Towards Robust File System Checkers , 2018, FAST.

[11]  Romain Rouvoy,et al.  Challenging Analytical Knowledge On Exception-Handling: An Empirical Study of 32 Java Software Packages , 2014 .

[12]  Shan Lu,et al.  TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems , 2016, ASPLOS.

[13]  Nélio Cacho,et al.  Do android developers neglect error handling? a maintenance-Centric study on the relationship between android abstractions and uncaught exceptions , 2018, J. Syst. Softw..

[14]  Yang Liu,et al.  Large-Scale Analysis of Framework-Specific Exceptions in Android Apps , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[15]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[16]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[17]  Felipe Ebert,et al.  A Reflection on “An Exploratory Study on Exception Handling Bugs in Java Programs” , 2015, 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[18]  Matthew Might,et al.  Pruning, Pushdown Exception-Flow Analysis , 2014, 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation.

[19]  Shan Lu,et al.  Understanding Real-World Timeout Problems in Cloud Server Systems , 2018, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[20]  Tao Xie,et al.  Mining exception-handling rules as sequence association rules , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[21]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[22]  Shan Lu,et al.  FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems , 2018, ASPLOS.

[23]  Uirá Kulesza,et al.  In-depth characterization of exception flows in software product lines: an empirical study , 2013, Journal of Software Engineering Research and Development.

[24]  Andrea C. Arpaci-Dusseau,et al.  Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions , 2017, FAST.

[25]  Sahithi Thandra,et al.  Analysis of Exception Handling Patterns in Java Projects: An Empirical Study , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[26]  Dong Wang,et al.  An empirical study on crash recovery bugs in large-scale distributed systems , 2018, ESEC/SIGSOFT FSE.

[27]  George Candea,et al.  Failure sketching: a technique for automated root cause diagnosis of in-production failures , 2015, SOSP.

[28]  Uirá Kulesza,et al.  Understanding the Exception Handling Strategies of Java Libraries: An Empirical Study , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[29]  Uirá Kulesza,et al.  A catalogue of bug patterns for exception handling in aspect-oriented programs , 2008 .

[30]  Patrice Godefroid,et al.  Automated Whitebox Fuzz Testing , 2008, NDSS.

[31]  Andrea C. Arpaci-Dusseau,et al.  Correlated Crash Vulnerabilities , 2016, OSDI.

[32]  Samer Al-Kiswany,et al.  An Analysis of Network-Partitioning Failures in Cloud Systems , 2018, OSDI.

[33]  Yannis Smaragdakis,et al.  Exception analysis and points-to analysis: better together , 2009, ISSTA.

[34]  Roberta Coelho,et al.  Improving developers awareness of the exception handling policy , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[35]  Ding Yuan,et al.  How do fixes become bugs? , 2011, ESEC/FSE '11.

[36]  Brad A. Myers,et al.  Examining Programmer Practices for Locally Handling Exceptions , 2016, 2016 IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR).

[37]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.