Torturing Databases for Fun and Profit

Programmers use databases when they want a high level of reliability. Specifically, they want the sophisticated ACID (atomicity, consistency, isolation, and durability) protection modern databases provide. However, the ACID properties are far from trivial to provide, particularly when high performance must be achieved. This leads to complex and error-prone code--even at a low defect rate of one bug per thousand lines, the millions of lines of code in a commercial OLTP database can harbor thousands of bugs. Here we propose a method to expose and diagnose violations of the ACID properties. We focus on an ostensibly easy case: power faults. Our framework includes workloads to exercise the ACID guarantees, a record/replay subsystem to allow the controlled injection of simulated power faults, a ranking algorithm to prioritize where to fault based on our experience, and a multi-layer tracer to diagnose root causes. Using our framework, we study 8 widely-used databases, ranging from open-source key-value stores to high-end commercial OLTP servers. Surprisingly, all 8 databases exhibit erroneous behavior. For the open-source databases, we are able to diagnose the root causes using our tracer, and for the proprietary commercial databases we can reproducibly induce data loss.

[1]  Lara Dolecek,et al.  Tackling intracell variability in TLC Flash through tensor product codes , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[2]  Terence Kelly,et al.  Failure-atomic msync(): a simple and efficient mechanism for preserving the integrity of durable data , 2013, EuroSys '13.

[3]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[4]  Steven Swanson,et al.  Understanding the impact of power loss on flash memory , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Andrea C. Arpaci-Dusseau,et al.  All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[6]  Elaine J. Weyuker,et al.  An AGENDA for testing relational database applications , 2004, Softw. Test. Verification Reliab..

[7]  Peter M. Chen,et al.  The Rio file cache: surviving operating system crashes , 1996, ASPLOS VII.

[8]  Donald R. Slutz,et al.  Massive Stochastic Testing of SQL , 1998, VLDB.

[9]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[10]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[12]  Andrea C. Arpaci-Dusseau,et al.  Consistency without ordering , 2012, FAST.

[13]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[14]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[15]  Jeffrey F. Naughton,et al.  Impact of disk corruption on open-source DBMS , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[16]  A. Zeller Isolating cause-effect chains from computer programs , 2002, SIGSOFT '02/FSE-10.

[17]  Gustavo Alonso,et al.  RapiLog: reducing system complexity through verification , 2013, EuroSys '13.

[18]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[19]  Leo Giakoumakis,et al.  A genetic approach for random testing of database systems , 2007, VLDB.

[20]  Mark Lillibridge,et al.  Understanding the robustness of SSDS under power fault , 2013, FAST.