Deterministic replay using processor support and its applications

The processor industry is at an inflection point. In the past, performance was the driving force behind the processor industry. But in the coming many-core era, improving programmability and reliability of the system will be at least as important as improving raw performance. To meet this vision, this thesis presents a processor feature that assists programmers in understanding software failures. Reproducing software failures is a significant challenge. The problem is severe especially for multi-threaded programs because the causes of failure can be non-deterministic in nature. The proposed processor feature continuously logs a program's execution while sacrificing very little performance (1%). If the program crashes, the developer can use the log to debug the failure by deterministically replaying every single instruction executed as part of the failed program's execution. Two key mechanisms enable this deterministic replay feature. One is BugNet, a checkpointing technique, which logs all of the non-deterministic input to a thread by logging the values of load instructions. The other is Strata, a logging primitive for recording shared-memory dependencies in a snoop-based or a directory-based shared-memory multi-processor. The former is sufficient for uni-processor systems and the later is required for multi-processor systems. As a proof-of-concept, this thesis presents a software implementation of BugNet replayer built using the Pin instrumentation tool. To understand the space requirements of the BugNet recorder for debugging, this thesis empirically quantifies how much of a program's execution need to be logged and replayed in order to understand the root cause of a majority of bugs. Finally, to demonstrate the utility of the deterministic replay feature, this thesis presents a software tool built using a deterministic replayer that finds data race bugs in shared-memory multi-threaded programs and automatically prioritizes them. The data race detection tool was built in collaboration with Microsoft. It has been used to find and fix data race bugs in production code, including Windows Vista and Internet Explorer.

[1]  Olatunji Ruwase,et al.  A Practical Dynamic Buffer Overflow Detector , 2004, NDSS.

[2]  Rahul Agarwal,et al.  Automated type-based analysis of data races and atomicity , 2005, PPoPP.

[3]  Satish Narayanasamy,et al.  Software Profiling for Deterministic Replay Debugging of User Code , 2006, SoMeT.

[4]  Mark Moir,et al.  Lock-free reference counting , 2002 .

[5]  Peter M. Chen,et al.  ExtraVirt: detecting and recovering from transient processor faults , 2005, SOSP '05.

[6]  Josep Torrellas,et al.  CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[7]  Bob Boothe Efficient algorithms for bidirectional debugging , 2000, PLDI '00.

[8]  Satish Narayanasamy,et al.  BugNet: Recording Application-Level Execution for Deterministic Replay Debugging , 2006, IEEE Micro.

[9]  Ruby B. Lee,et al.  Enlisting Hardware Architecture to Thwart Malicious Code Injection , 2004, SPC.

[10]  John Johansen,et al.  PointGuard™: Protecting Pointers from Buffer Overflow Vulnerabilities , 2003, USENIX Security Symposium.

[11]  Thomas A. Henzinger,et al.  Race checking by context inference , 2004, PLDI '04.

[12]  Jeffrey S. Foster,et al.  LOCKSMITH: context-sensitive correlation analysis for race detection , 2006, PLDI '06.

[13]  Nicholas Sterling,et al.  WARLOCK - A Static Data Race Analysis Tool , 1993, USENIX Winter.

[14]  Jong-Deok Choi,et al.  Efficient and precise datarace detection for multithreaded object-oriented programs , 2002, PLDI '02.

[15]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[16]  Koen De Bosschere,et al.  JaRec: a portable record/replay environment for multi‐threaded Java applications , 2004, Softw. Pract. Exp..

[17]  stallman-richard-m-cygnus-solutions Debugging with GDB: The GNU Source-Level Debugger for GDB , 2000 .

[18]  Mark A. Linton,et al.  Supporting reverse execution for parallel programs , 1988, PADD '88.

[19]  Edith Schonberg,et al.  Detecting access anomalies in programs with critical sections , 1991, PADD '91.

[20]  Koen De Bosschere,et al.  TORNADO: A Novel Input Replay Tool , 2003, PDPTA.

[21]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .

[22]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multithreaded programs , 1997, TOCS.

[23]  Assaf Schuster,et al.  Efficient on-the-fly data race detection in multithreaded C++ programs , 2003, PPoPP '03.

[24]  Jong-Deok Choi,et al.  Techniques for debugging parallel programs with flowback analysis , 1991, TOPL.

[25]  Thomas J. Ostrand,et al.  Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria , 1994, Proceedings of 16th International Conference on Software Engineering.

[26]  Min Xu,et al.  A serializability violation detector for shared-memory server programs , 2005, PLDI '05.

[27]  Glenford J. Myers Advances in computer architecture , 1978 .

[28]  Jong-Deok Choi,et al.  A perturbation-free replay platform for cross-optimized multithreaded applications , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[29]  Xiangyu Zhang,et al.  Experimental evaluation of using dynamic slices for fault location , 2005, AADEBUG'05.

[30]  Samuel T. King,et al.  Debugging Operating Systems with Time-Traveling Virtual Machines (Awarded General Track Best Paper Award!) , 2005, USENIX Annual Technical Conference, General Track.

[31]  Jun Yang,et al.  Energy efficient Frequent Value data Cache design , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[32]  Anant Agarwal,et al.  TraceBack: first fault diagnosis by reconstruction of distributed control flow , 2005, PLDI '05.

[33]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[34]  John M. Mellor-Crummey,et al.  On-the-fly detection of data races for programs with nested fork-join parallelism , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[35]  Bil Lewis,et al.  Debugging Backwards in Time , 2003, ArXiv.

[36]  Jong-Deok Choi,et al.  An efficient cache-based access anomaly detection scheme , 1991, ASPLOS IV.

[37]  Robert O. Hastings,et al.  Fast detection of memory leaks and access errors , 1991 .

[38]  Peter J. Keleher,et al.  Online data-race detection via coherency guarantees , 1996, OSDI '96.

[39]  Simha Sethumadhavan,et al.  Scalable Hardware Memory Disambiguation for High-ILP Processors , 2004, IEEE Micro.

[40]  Yuanyuan Zhou,et al.  SafeMem: exploiting ECC-memory for detecting memory leaks and memory corruption during production runs , 2005, 11th International Symposium on High-Performance Computer Architecture.

[41]  Elliott I. Organick,et al.  Computer System Organization: The B5700/B6700 Series , 1973 .

[42]  Rahul Agarwal,et al.  Optimized run-time race detection and atomicity checking using partial discovered types , 2005, ASE.

[43]  Martin Burtscher,et al.  Automatic generation of high-performance trace compressors , 2005, International Symposium on Code Generation and Optimization.

[44]  James R. Larus,et al.  Protocol-based data-race detection , 1998, SPDT '98.

[45]  Seth Copen Goldstein,et al.  Hardware-assisted replay of multiprocessor programs , 1991, PADD '91.

[46]  Milos Prvulovic,et al.  CORD: cost-effective (and nearly overhead-free) order-recording and data race detection , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[47]  Satish Narayanasamy,et al.  Automatically classifying benign and harmful data races using replay analysis , 2007, PLDI '07.

[48]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[49]  Frank Tip,et al.  Generic Techniques for Source-Level Debugging and Dynamic Program Slicing , 1995, TAPSOFT.

[50]  John Steven,et al.  jRapture: A Capture/Replay tool for observation-based testing , 2000, ISSTA '00.

[51]  David L. Weaver,et al.  The SPARC Architecture Manual , 2003 .

[52]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[53]  Sanjay Bhansali,et al.  Framework for instruction-level tracing and analysis of program executions , 2006, VEE '06.

[54]  Jong-Deok Choi,et al.  Deterministic replay of Java multithreaded applications , 1998, SPDT '98.

[55]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[56]  Corporate SPARC architecture manual - version 8 , 1992 .

[57]  Yuanyuan Zhou,et al.  AVIO: Detecting Atomicity Violations via Access-Interleaving Invariants , 2007, IEEE Micro.

[58]  Dinghao Wu,et al.  KISS: keep it simple and sequential , 2004, PLDI '04.

[59]  Satish Narayanasamy,et al.  Recording shared memory dependencies using strata , 2006, ASPLOS XII.

[60]  Nicholas Nethercote,et al.  Using Valgrind to Detect Undefined Value Errors with Bit-Precision , 2005, USENIX Annual Technical Conference, General Track.

[61]  Alexander Aiken,et al.  Effective static race detection for Java , 2006, PLDI '06.

[62]  Stuart I. Feldman,et al.  IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[63]  Robert H. B. Netzer Optimal tracing and replay for debugging shared-memory parallel programs , 1993, PADD '93.

[64]  Stephen N. Freund,et al.  Type-based race detection for Java , 2000, PLDI '00.

[65]  Koen De Bosschere,et al.  RecPlay: a fully integrated practical record/replay system , 1999, TOCS.

[66]  Jong-Deok Choi,et al.  Race Frontier: reproducing data races in parallel-program debugging , 1991, PPOPP '91.

[67]  Jong-Deok Choi,et al.  Hybrid dynamic data race detection , 2003, PPoPP '03.

[68]  Martin C. Rinard,et al.  ACM Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA), November 2002 Ownership Types for Safe Programming: Preventing Data Races and Deadlocks , 2022 .

[69]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[70]  Koen De Bosschere,et al.  Non-intrusive on-the-fly data race detection using execution replay , 2000, AADEBUG.

[71]  Wei Liu,et al.  iWatcher: efficient architectural support for software debugging , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[72]  George Varghese,et al.  Hardware and Binary Modification Support for Code Pointer Protection From Buffer Overflow , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[73]  Amir Roth,et al.  Low-overhead interactive debugging via dynamic instrumentation with DISE , 2005, 11th International Symposium on High-Performance Computer Architecture.

[74]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[75]  Serdar Tasiran,et al.  VYRD: verifYing concurrent programs by runtime refinement-violation detection , 2005, PLDI '05.

[76]  Mark Russinovich,et al.  Replay for concurrent non-deterministic shared-memory applications , 1996, PLDI '96.

[77]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[78]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[79]  Min Xu,et al.  A regulated transitive reduction (RTR) for longer memory race recording , 2006, ASPLOS XII.

[80]  Dawson R. Engler,et al.  RacerX: effective, static detection of race conditions and deadlocks , 2003, SOSP '03.

[81]  Gregory Tassey,et al.  Prepared for what , 2007 .

[82]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[83]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[84]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[85]  Evan Marcus,et al.  Blueprints for high availability , 2000 .

[86]  Robert H. B. Netzer,et al.  Detecting data races on weak memory systems , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[87]  Crispan Cowan,et al.  StackGuard: Automatic Adaptive Detection and Prevention of Buffer-Overflow Attacks , 1998, USENIX Security Symposium.

[88]  Thomas J. LeBlanc,et al.  Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.

[89]  Satish Narayanasamy,et al.  BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[90]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[91]  Hiroyasu Nishiyama,et al.  Detecting Data Races Using Dynamic Escape Analysis Based on Read Barrier , 2004, Virtual Machine Research and Technology Symposium.

[92]  Jim Gray,et al.  Distributed Computing Economics , 2004, ACM Queue.

[93]  Wei Liu,et al.  AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[94]  Edith Schonberg,et al.  An empirical comparison of monitoring algorithms for access anomaly detection , 2011, PPOPP '90.

[95]  J. Torrellas,et al.  ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[96]  John Wilander,et al.  A Comparison of Publicly Available Tools for Dynamic Buffer Overflow Prevention , 2003, NDSS.

[97]  Mark Scott Johnson Some requirements for architectural support of software debugging , 1982, ASPLOS I.

[98]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[99]  Mark Christiaens,et al.  A Taxonomy of Execution Replay Systems , 2003 .

[100]  Satish Narayanasamy,et al.  Automatic logging of operating system effects to guide application-level architecture simulation , 2006, SIGMETRICS '06/Performance '06.

[101]  Thomas R. Gross,et al.  Object race detection , 2001, OOPSLA '01.