DScope: Detecting Real-World Data Corruption Hang Bugs in Cloud Server Systems

Cloud server systems such as Hadoop and Cassandra have enabled many real-world data-intensive applications running inside computing clouds. However, those systems present many data-corruption and performance problems which are notoriously difficult to debug due to the lack of diagnosis information. In this paper, we present DScope, a tool that statically detects data-corruption related software hang bugs in cloud server systems. DScope statically analyzes I/O operations and loops in a software package, and identifies loops whose exit conditions can be affected by I/O operations through returned data, returned error code, or I/O exception handling. After identifying those loops which are prone to hang problems under data corruption, DScope conducts loop bound and loop stride analysis to prune out false positives. We have implemented DScope and evaluated it using 9 common cloud server systems. Our results show that DScope can detect 42 real software hang bugs including 29 newly discovered software hang bugs. In contrast, existing bug detection tools miss detecting most of those bugs.

[1]  Xiaohui Gu,et al.  Understanding Real World Data Corruptions in Cloud Systems , 2015, 2015 IEEE International Conference on Cloud Engineering.

[2]  William Enck,et al.  Automatic Server Hang Bug Diagnosis: Feasible Reality or Pipe Dream? , 2015, 2015 IEEE International Conference on Autonomic Computing.

[3]  Angela Demke Brown,et al.  Checking the Integrity of Transactional Mechanisms , 2014, TOS.

[4]  Andrea C. Arpaci-Dusseau,et al.  End-to-end Data Integrity for File Systems: A ZFS Case Study , 2010, FAST.

[5]  Angela Demke Brown,et al.  Robust Consistency Checking for Modern Filesystems , 2014, RV.

[6]  Andrea C. Arpaci-Dusseau,et al.  Analyzing the effects of disk-pointer corruption , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[7]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[8]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[9]  Shan Lu,et al.  Hytrace: A Hybrid Approach to Performance Bug Diagnosis in Production Cloud Infrastructures , 2017, IEEE Transactions on Parallel and Distributed Systems.

[10]  Andrea C. Arpaci-Dusseau,et al.  Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions , 2017, FAST.

[11]  Andrea C. Arpaci-Dusseau,et al.  Redundancy Does Not Imply Fault Tolerance , 2017, ACM Trans. Storage.

[12]  Martin C. Rinard,et al.  Detecting and Escaping Infinite Loops with Jolt , 2011, ECOOP.

[13]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[14]  Tanakorn Leesatapornwongsa,et al.  What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems , 2014, SoCC.

[15]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[16]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[17]  Andrea C. Arpaci-Dusseau,et al.  Dependability Analysis of Virtual Memory Systems , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[18]  Karsten Schwan,et al.  Understanding issue correlations: a case study of the Hadoop system , 2015, SoCC.

[19]  Shan Lu,et al.  Understanding and detecting real-world performance bugs , 2012, PLDI.

[20]  Christof Fetzer,et al.  Efficient Fault Tolerance using Intel MPX and TSX , 2016, DSN 2016.

[21]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[22]  Erik van der Kouwe,et al.  HSFI: Accurate Fault Injection Scalable to Large Code Bases , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[23]  Christof Fetzer,et al.  Fex: A Software Systems Evaluator , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[24]  Barbara G. Ryder,et al.  A Sharper Sense of Self: Probabilistic Reasoning of Program Behaviors for Anomaly Detection with Context Sensitivity , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[25]  Yu Chen,et al.  pbSE: Phase-Based Symbolic Execution , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[26]  Kaushik Veeraraghavan,et al.  Canopy: An End-to-End Performance Tracing And Analysis System , 2017, SOSP.

[27]  Shan Lu,et al.  CARAMEL: Detecting and Fixing Performance Problems That Have Non-Intrusive Fixes , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[28]  Ravishankar K. Iyer,et al.  Characterization of linux kernel behavior under errors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[29]  Dongmei Zhang,et al.  Context-sensitive delta inference for identifying workload-dependent performance bottlenecks , 2013, ISSTA.

[30]  Nuno Laranjeiro,et al.  Test-Based Interoperability Certification for Web Services , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[31]  Asim Kadav,et al.  Tolerating hardware device failures in software , 2009, SOSP '09.

[32]  Daniel P. Siewiorek,et al.  Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[33]  Shivnath Babu,et al.  Dealing proactively with data corruption: Challenges and opportunities , 2011, 2011 IEEE 27th International Conference on Data Engineering Workshops.

[34]  Shan Lu,et al.  Toddler: Detecting performance problems via similar memory-access patterns , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[35]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[36]  Shan Lu,et al.  Performance Diagnosis for Inefficient Loops , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[37]  Herbert Bos,et al.  Towards Automated Discovery of Crash-Resistant Primitives in Binary Executables , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[38]  Xiaohui Gu,et al.  PerfScope: Practical Online Server Performance Bug Inference in Production Cloud Computing Infrastructures , 2014, SoCC.

[39]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.