Optimal tracing and incremental reexecution for debugging long-running programs

Debugging requires execution replay. Locations of bugs are rarely known in advance, so an execution must be repeated over and over to track down bugs. A problem arises with repeated reexecution for long-running programs and programs that have complex interactions with their environment. Replaying long-running programs from the start incurs too much delay. Replaying programs that interact with their environment requires the difficult (and sometimes impossible) task of exactly reproducing this environment (such as the connections over a one-day period to an X server). We solve these problems by incremental checkpointing and replay. By periodically checkpointing parts of the execution''s state, it can be restarted from intermediate points, bounding the delay to replay any part of the execution and allowing parts of the execution to be skipped. We present adaptive tracing strategies that provide bounded-time incremental replay and that are nearly optimal. Our techniques track reads and writes to memory using space-efficient two-level bitvectors. Our implementation on a Sparc 10 traces less than 15 kilobytes/sec for CPU-intensive programs and for interactive programs the slowdown is low enough that tracing can be left on all the time.

[1]  Eugene H. Spafford,et al.  An execution-backtracking approach to debugging , 1991, IEEE Software.

[2]  James R. Larus,et al.  Abstract execution: A technique for efficiently tracing programs , 1990, Softw. Pract. Exp..

[3]  Jong-Deok Choi,et al.  A mechanism for efficient debugging of parallel programs , 1988, PADD '88.

[4]  Paul R. Wilson,et al.  Demonic memory for process histories , 1989, PLDI '89.

[5]  Robert Wahbe,et al.  Practical data breakpoints: design and implementation , 1993, PLDI '93.

[6]  James R. Larus,et al.  Efficient program tracing , 1993, Computer.

[7]  James R. Larus,et al.  Optimally profiling and tracing programs , 1992, POPL '92.

[8]  Steven P. Reiss Trace-Based Debugging , 1993, AADEBUG.

[9]  Jong-Deok Choi,et al.  Balancing runtime and replay costs in a trace-and-replay system , 1991, PADD '91.

[10]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[11]  Thomas J. LeBlanc,et al.  A software instruction counter , 1989, ASPLOS III.

[12]  Stuart I. Feldman,et al.  IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[13]  Jong-Deok Choi,et al.  A Mechanism for Efficient Debugging of Parallel Programs , 1988, PLDI.

[14]  Jong-Deok Choi,et al.  Techniques for debugging parallel programs with flowback analysis , 1991, TOPL.

[15]  Jacob A. Abraham,et al.  Compiler-assisted static checkpoint insertion , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.