Record and transplay: partial checkpointing for replay debugging across heterogeneous systems

Software bugs that occur in production are often difficult to reproduce in the lab due to subtle differences in the application environment and nondeterminism. To address this problem, we present Transplay, a system that captures production software bugs into small per-bug recordings which are used to reproduce the bugs on a completely different operating system without access to any of the original software used in the production environment. Transplay introduces partial checkpointing, a new mechanism that efficiently captures the partial state necessary to reexecute just the last few moments of the application before it encountered a failure. The recorded state, which typically consists of a few megabytes of data, is used to replay the application without requiring the specific application binaries, libraries, support data, or the original execution environment. Transplay integrates with existing debuggers to provide standard debugging facilities to allow the user to examine the contents of variables and other program state at each source line of the application's replayed execution. We have implemented a Transplay prototype that can record unmodified Linux applications and replay them on different versions of Linux as well as Windows. Experiments with several applications including Apache and MySQL show that Transplay can reproduce real bugs and be used in production with modest recording overhead.

[1]  Thomas J. LeBlanc,et al.  Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.

[2]  Stuart I. Feldman,et al.  IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[3]  Mark A. Linton,et al.  Supporting reverse execution for parallel programs , 1988, PADD '88.

[4]  Robert O. Hastings,et al.  Fast detection of memory leaks and access errors , 1991 .

[5]  Michael B. Jones,et al.  Interposition agents: transparently interposing user code at the system interface , 1994, SOSP '93.

[6]  Robert H. B. Netzer,et al.  Optimal tracing and incremental reexecution for debugging long-running programs , 1994, PLDI '94.

[7]  Yang Meng Tan,et al.  LCLint: a tool for using specifications to check code , 1994, SIGSOFT '94.

[8]  Ian Goldberg,et al.  A Secure Environment for Untrusted Helper Applications ( Confining the Wily Hacker ) , 1996 .

[9]  James S. Plank,et al.  An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .

[10]  Thomas E. Anderson,et al.  SLIC: An Extensibility System for Commodity Operating Systems , 1998, USENIX ATC.

[11]  David Mosberger,et al.  httperf—a tool for measuring web server performance , 1998, PERV.

[12]  Koen De Bosschere,et al.  RecPlay: a fully integrated practical record/replay system , 1999, TOCS.

[13]  John Steven,et al.  jRapture: A Capture/Replay tool for observation-based testing , 2000, ISSTA '00.

[14]  Sotiris Ioannidis,et al.  Sub-operating systems: a new approach to application security , 2002, EW 10.

[15]  Jason Nieh,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[16]  Mikel Luján,et al.  Elimination of Java array bounds checks in the presence of indirection , 2002, JGI '02.

[17]  Koen De Bosschere,et al.  TORNADO: A Novel Input Replay Tool , 2003, PDPTA.

[18]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[19]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[20]  Tal Garfinkel,et al.  Ostia: A Delegating Architecture for Secure System Call Interposition , 2004, NDSS.

[21]  Jeffrey C. Mogul,et al.  Unveiling the transport , 2004, CCRV.

[22]  Satish Narayanasamy,et al.  BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[23]  Jose Renato Santos,et al.  Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[24]  Song Jiang,et al.  Current practice and a direction forward in checkpoint/restart implementations for fault tolerance , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[25]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[26]  Samuel T. King,et al.  Debugging Operating Systems with Time-Traveling Virtual Machines (Awarded General Track Best Paper Award!) , 2005, USENIX Annual Technical Conference, General Track.

[27]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[28]  Yasushi Saito,et al.  Jockey: a user-space library for record-replay debugging , 2005, AADEBUG'05.

[29]  Yuanyuan Zhou,et al.  BugBench: Benchmarks for Evaluating Bug Detection Tools , 2005 .

[30]  Satish Narayanasamy,et al.  Recording shared memory dependencies using strata , 2006, ASPLOS XII.

[31]  Serge E. Hallyn,et al.  Making Applications Mobile Under Linux , 2006 .

[32]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems , 2007, USENIX Annual Technical Conference.

[33]  Marc Vertes,et al.  Fault Tolerance in Multiprocessor Systems Via Application Cloning , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[34]  Yuanyuan Zhou,et al.  Triage: diagnosing production run failures at the user's site , 2007, SOSP.

[35]  Xuezheng Liu,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation R2: an Application-level Kernel for Record and Replay , 2022 .

[36]  Tal Garfinkel,et al.  VMwareDecoupling Dynamic Program Analysis from Execution in Virtual Environments , 2008, USENIX Annual Technical Conference.

[37]  Thomas Ball,et al.  Finding and Reproducing Heisenbugs in Concurrent Programs , 2008, OSDI.

[38]  Peter M. Chen,et al.  Execution replay of multiprocessor virtual machines , 2008, VEE '08.

[39]  Michael D. Ernst,et al.  Automatically patching errors in deployed software , 2009, SOSP '09.

[40]  Yuanyuan Zhou,et al.  PRES: probabilistic replay with execution sketching on multiprocessors , 2009, SOSP '09.

[41]  Ion Stoica,et al.  ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.

[42]  Josep Torrellas,et al.  Capo: a software-hardware interface for practical deterministic multiprocessor replay , 2009, ASPLOS.

[43]  Angelos D. Keromytis,et al.  ASSURE: automatic software self-healing using rescue points , 2009, ASPLOS.

[44]  Tal Garfinkel,et al.  Multi-stage replay with crosscut , 2010, VEE '10.

[45]  Jason Nieh,et al.  Transparent, lightweight application execution replay on commodity multiprocessor operating systems , 2010, SIGMETRICS '10.