Predictive and Programmable Testing of Concurrent and Cloud Systems

Today's software systems often have poor reliability. In addition to losses of billions, software defects are responsible for a number of serious injuries and deaths in transportation accidents, medical treatments, and defense operations. The situation is getting worse with concurrency and distributed computing becoming integral parts of many real-world software systems. The non-determinism in concurrent and distributed systems and the unreliability of the hardware environment in which they operate can result in defects that are hard to find and understand. In this thesis, we have developed tools and techniques to augment testing to enable it to quickly find and reproduce important bugs in concurrent and distributed systems. Our techniques are based on the following two key ideas: (i) use program analysis to increase coverage by predicting bugs that could have occurred in "nearby" program executions, and (ii) provide programming abstractions to enable testers to easily express their insights to guide testing towards those executions that are more likely to exhibit bugs or help achieve testing objectives without having any knowledge about the underlying testing process. The tools that we have built have found many serious bugs in large real-world software systems (e.g. Jigsaw web server, JDK, JGroups, and Hadoop File System). In the first part of the thesis, we describe how we can predict and confirm bugs in the executions of concurrent systems that did not show up during testing but that could have shown up had the program under consideration executed under different thread schedules. This improves the coverage of testing, and helps find corner-case bugs that are unlikely to be discovered during traditional testing. We have built predictive testing tools to find different classes of serious bugs like deadlocks, hangs, and typestate errors in concurrent systems. In the second part of the thesis, we investigate how we can improve the efficiency of testing of distributed cloud systems by letting testers guide testing towards the executions that are interesting to them. For example, a tester might want to test those executions that are more likely to be erroneous or that are more likely to help her achieve her testing objectives. We have built tools and frameworks that enable testers to easily express their knowledge and intuition to guide testing without having any knowledge about the underlying testing process. We have investigated programmable testing tools in the context of testing of large-scale distributed systems.

[1]  Patrice Godefroid,et al.  Partial-Order Methods for the Verification of Concurrent Systems , 1996, Lecture Notes in Computer Science.

[2]  Martin C. Rinard,et al.  ACM Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA), November 2002 Ownership Types for Safe Programming: Preventing Data Races and Deadlocks , 2022 .

[3]  Schahram Dustdar,et al.  Programmable Fault Injection Testbeds for Complex SOA , 2010, ICSOC.

[4]  Christoph von Praun,et al.  Detecting synchronization defects in multi-threaded object-oriented programs , 2004 .

[5]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[6]  Eitan Farchi,et al.  Concurrent bug patterns and how to test them , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[7]  Sarfraz Khurshid,et al.  Test generation through programming in UDITA , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[8]  Sébastien Tixeuil,et al.  FAIL-FCI: Versatile fault injection , 2007, Future Gener. Comput. Syst..

[9]  Anna Gringauze,et al.  Detecting Data Race and Atomicity Violation via Typestate-Guided Static Analysis , 2008 .

[10]  Stephen N. Freund,et al.  Atomizer: A dynamic atomicity checker for multithreaded programs , 2008, Sci. Comput. Program..

[11]  Rahul Agarwal,et al.  Run-time detection of potential deadlocks for programs with locks, semaphores, and condition variables , 2006, PADTAD '06.

[12]  David Gay,et al.  Effective static deadlock detection , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[13]  Vijay K. Garg,et al.  Partial Order Trace Analyzer (POTA) for Distributed Programs , 2003, RV@CAV.

[14]  Barbara G. Ryder,et al.  Parameterized object sensitivity for points-to analysis for Java , 2005, TSEM.

[15]  Dawson R. Engler,et al.  RacerX: effective, static detection of race conditions and deadlocks , 2003, SOSP '03.

[16]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[17]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[18]  Koushik Sen,et al.  PreFail: A Programmable Failure-Injection Framework , 2011 .

[19]  Armin Biere,et al.  Applying static analysis to large-scale, multi-threaded Java programs , 2001, Proceedings 2001 Australian Software Engineering Conference.

[20]  Lorenzo Keller,et al.  AFEX: An Automated Fault Explorer for Faster System Testing , 2008 .

[21]  Patrice Godefroid,et al.  Model checking for programming languages using VeriSoft , 1997, POPL '97.

[22]  Eran Yahav,et al.  Effective typestate verification in the presence of aliasing , 2006, TSEM.

[23]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[24]  Chen Fu,et al.  Testing of java web services for robustness , 2004, ISSTA '04.

[25]  George Candea,et al.  Crash-Only Software , 2003, HotOS.

[26]  Eitan Farchi,et al.  Multithreaded Java program test generation , 2002, IBM Syst. J..

[27]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[28]  Ozalp Babaoglu,et al.  Consistent global states of distributed systems: fundamental concepts and mechanisms , 1993 .

[29]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[30]  Rahul Agarwal,et al.  Detecting Potential Deadlocks with Static Analysis and Run-Time Monitoring , 2005, Haifa Verification Conference.

[31]  Chris J. Price,et al.  Automated multiple failure FMEA , 2002, Reliab. Eng. Syst. Saf..

[32]  Bensalem Saddek,et al.  SCALABLE DYNAMIC DEADLOCK ANALYSIS OF MULTI-THREADED PROGRAMS , 2005 .

[33]  Claudio Demartini,et al.  A deadlock detection tool for concurrent Java programs , 1999, Softw. Pract. Exp..

[34]  Eran Yahav,et al.  Static Specification Mining Using Automata-Based Abstractions , 2008, IEEE Trans. Software Eng..

[35]  Michael D. Ernst,et al.  Static Deadlock Detection for Java Libraries , 2005, ECOOP.

[36]  Garth A. Gibson,et al.  dBug: Systematic Evaluation of Distributed Systems , 2010, SSV.

[37]  Klaus Havelund,et al.  Dynamic Deadlock Analysis of Multi-threaded Programs , 2005, Haifa Verification Conference.

[38]  Koushik Sen,et al.  Randomized active atomicity violation detection in concurrent programs , 2008, SIGSOFT '08/FSE-16.

[39]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multithreaded programs , 1997, TOCS.

[40]  Scott D. Stoller,et al.  Accurate and efficient runtime detection of atomicity errors in concurrent programs , 2006, PPoPP '06.

[41]  Sarfraz Khurshid,et al.  Korat: automated testing based on Java predicates , 2002, ISSTA '02.

[42]  David Gay,et al.  An effective dynamic analysis for detecting generalized deadlocks , 2010, FSE '10.

[43]  Horatiu Jula,et al.  Deadlock Immunity: Enabling Systems to Defend Against Deadlocks , 2008, OSDI.

[44]  Grigore Rosu,et al.  jPredictor , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[45]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[46]  Jerome A. Feldman,et al.  On the Synthesis of Finite-State Machines from Samples of Their Behavior , 1972, IEEE Transactions on Computers.

[47]  Klaus Havelund,et al.  Confirmation of deadlock potentials detected by runtime analysis , 2006, PADTAD '06.

[48]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[49]  Junfeng Yang,et al.  Practical software model checking via dynamic interface reduction , 2011, SOSP.

[50]  Koushik Sen,et al.  Predictive Typestate Checking of Multithreaded Java Programs , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[51]  Xiangyu Zhang,et al.  Efficient program execution indexing , 2008, PLDI '08.

[52]  Jonathan Aldrich,et al.  Practical API Protocol Checking with Access Permissions , 2009, ECOOP.

[53]  Gerard J. Holzmann,et al.  The SPIN Model Checker , 2003 .

[54]  Koushik Sen,et al.  PREFAIL: a programmable tool for multiple-failure injection , 2011, OOPSLA '11.

[55]  Jong-Deok Choi,et al.  Hybrid dynamic data race detection , 2003, PPoPP '03.

[56]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[57]  Tong Li,et al.  Pulse: A Dynamic Deadlock Detection Mechanism Using Speculative Execution , 2005, USENIX Annual Technical Conference, General Track.

[58]  Radu Banabic,et al.  An Extensible Technique for High-Precision Testing of Recovery Code , 2010, USENIX Annual Technical Conference.

[59]  Colin J. Fidge,et al.  Timestamps in Message-Passing Systems That Preserve the Partial Ordering , 1988 .

[60]  Robert DeLine,et al.  Typestates for Objects , 2004, ECOOP.

[61]  Koushik Sen,et al.  Online efficient predictive safety analysis of multithreaded programs , 2005, International Journal on Software Tools for Technology Transfer.

[62]  Scott D. Stoller,et al.  Run-Time Analysis for Atomicity , 2003, RV@CAV.

[63]  Koushik Sen,et al.  Runtime safety analysis of multithreaded programs , 2003, ESEC/FSE-11.

[64]  Jerry J. Harrow Runtime Checking of Multithreaded Applications with Visual Threads , 2000, SPIN.

[65]  Qi Gao,et al.  2ndStrike: toward manifesting hidden concurrency typestate bugs , 2011, ASPLOS XVI.

[66]  Philip Koopman,et al.  Comparing the robustness of POSIX operating systems , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[67]  Klaus Havelund,et al.  Using Runtime Analysis to Guide Model Checking of Java Programs , 2013, SPIN.

[68]  C. A. R. Hoare,et al.  Monitors: an operating system structuring concept , 1974, CACM.

[69]  George Candea,et al.  LFI: A practical and general library-level fault injector , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[70]  Amin Vahdat,et al.  Mace: language support for building distributed systems , 2007, PLDI '07.

[71]  Neeraj Suri,et al.  Error propagation profiling of operating systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[72]  Farnam Jahanian,et al.  Experiments on six commercial TCP implementations using a software fault injection tool , 1997, Softw. Pract. Exp..

[73]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[74]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[75]  Scott D. Stoller,et al.  Testing Concurrent Java Programs using Randomized Scheduling , 2002, RV@FLoC.

[76]  Edith Schonberg,et al.  Detecting access anomalies in programs with critical sections , 1991, PADD '91.

[77]  Jong-Deok Choi,et al.  Efficient and precise datarace detection for multithreaded object-oriented programs , 2002, PLDI '02.

[78]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[79]  Joël Ouaknine,et al.  Concurrent software verification with states, events, and deadlocks , 2005, Formal Aspects of Computing.

[80]  Anand Raman,et al.  The sk-strings method for inferring PFSA , 1997 .

[81]  David Hovemeyer,et al.  Finding Concurrency Bugs in Java , 2004 .

[82]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[83]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[84]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[85]  Koushik Sen,et al.  A randomized dynamic program analysis technique for detecting real deadlocks , 2009, PLDI '09.

[86]  Klaus Havelund,et al.  Model checking JAVA programs using JAVA PathFinder , 2000, International Journal on Software Tools for Technology Transfer.

[87]  Patrice Godefroid,et al.  Dynamic partial-order reduction for model checking software , 2005, POPL '05.

[88]  Mark Lillibridge,et al.  Extended static checking for Java , 2002, PLDI '02.

[89]  Stephen P. Masticola,et al.  Static detection of deadlocks in polynomial time , 1993 .

[90]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[91]  Laurie Hendren,et al.  Soot---a java optimization framework , 1999 .

[92]  Darko Marinov,et al.  Automated testing of refactoring engines , 2007, ESEC-FSE '07.

[93]  Koushik Sen,et al.  CalFuzzer: An Extensible Active Testing Framework for Concurrent Programs , 2009, CAV.

[94]  Barbara G. Ryder,et al.  A model of Ada programs for static deadlock detection in polynomial times , 1991, PADD '91.

[95]  Peter M. Broadwell,et al.  FIG: A Prototype Tool for Online Verification of Recovery Mechanisms , 2002 .

[96]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[97]  Oege de Moor,et al.  Making trace monitors feasible , 2007, OOPSLA.

[98]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[99]  Robert E. Strom,et al.  Typestate: A programming language concept for enhancing software reliability , 1986, IEEE Transactions on Software Engineering.

[100]  James R. Larus,et al.  Mining specifications , 2002, POPL '02.

[101]  Thomas R. Gross,et al.  Object race detection , 2001, OOPSLA '01.

[102]  Shmuel Ur,et al.  Deadlocks: From Exhibiting to Healing , 2008, RV.