Execution replay and debugging

As most parallel and distributed programs are internally non-deterministic -- consecutive runs with the same input might result in a different program flow -- vanilla cyclic debugging techniques as such are useless. In order to use cyclic debugging tools, we need a tool that records information about an execution so that it can be replayed for debugging. Because recording information interferes with the execution, we must limit the amount of information and keep the processing of the information fast. This paper contains a survey of existing execution replay techniques and tools.

[1]  Jack Dongarra,et al.  Pvm 3 user's guide and reference manual , 1993 .

[2]  Thomas J. LeBlanc,et al.  Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.

[3]  Seth Copen Goldstein,et al.  Hardware-assisted replay of multiprocessor programs , 1991, PADD '91.

[4]  Willy Zwaenepoel,et al.  Execution replay for treadmarks , 1997, PDP.

[5]  Koen De Bosschere,et al.  Execution replay for an MPI-based multi-threaded runtime system , 1999, PARCO.

[6]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[7]  Jong-Deok Choi,et al.  Deterministic replay of Java multithreaded applications , 1998, SPDT '98.

[8]  Barton P. Miller,et al.  Optimal tracing and replay for debugging message-passing parallel programs , 1992, Proceedings Supercomputing '92.

[9]  D. Stott Parker,et al.  Saving traces for Ada debugging , 1985, SIGAda '85.

[10]  Robert H. B. Netzer,et al.  Optimal tracing and incremental reexecution for debugging long-running programs , 1994, PLDI '94.

[11]  Jan M. Van Campenhout,et al.  Execution Replay with Compact Logs for Shared-Memory Programs , 1994, Applications in Parallel and Distributed Computing.

[12]  Mukesh Singhal,et al.  Logical Time: Capturing Causality in Distributed Systems , 1996, Computer.

[13]  Nancy A. Lynch,et al.  Discarding Obsolete Information in a Replicated Database System , 1987, IEEE Transactions on Software Engineering.

[14]  Dieter Kranzlmüller,et al.  NOPE: A Nondeterministic Program Evaluator , 1999, ACPC.

[15]  Piotr Kuzora,et al.  Efficient Replay of PVM Programs , 1999, PVM/MPI.

[16]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[17]  Robert H. B. Netzer Optimal tracing and replay for debugging shared-memory parallel programs , 1993, PADD '93.

[18]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[19]  Mark A. Linton,et al.  Supporting reverse execution for parallel programs , 1988, PADD '88.

[20]  Michel Raynal,et al.  On-the-fly replay: a practical paradigm and its implementation for distributed debugging , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[21]  Robert H. B. Netzer Trace Size vs. Parallelism in Trace-and-Replay Debugging of Shared-Memory Programs , 1993, LCPC.

[22]  Michel Raynal,et al.  About logical clocks for distributed systems , 1992, OPSR.

[23]  Jason Gait,et al.  A probe effect in concurrent programs , 1986, Softw. Pract. Exp..

[24]  David A. Padua,et al.  Automatic detection of nondeterminacy in parallel programs , 1988, PADD '88.

[25]  Marcelo Pasin,et al.  Athapascan: An Experience on Mixing MPI Communications and Threads , 1998, PVM/MPI.

[26]  Luk Levrouw,et al.  A New Trace And Replay System For Shared Memory Programs Based On Lamport Clocks , 1994, Proceedings. Second Euromicro Workshop on Parallel and Distributed Processing.

[27]  Koen De Bosschere,et al.  RecPlay: a fully integrated practical record/replay system , 1999, TOCS.

[28]  Roland Rühl,et al.  An Implementation of Race Detection and Deterministic Replay with MPI , 1995, Euro-Par.

[29]  Mark Minas Cyclic Debugging for pSather, a Parallel Object-Oriented Programming Language , 1998 .

[30]  Colin J. Fidge,et al.  Logical time in distributed computing systems , 1991, Computer.

[31]  Luk Levrouw,et al.  Interrupt replay: a debugging method for parallel programs with interrupts , 1994, Microprocess. Microsystems.

[32]  André Schiper,et al.  Efficient Execution Replay Technique for Distributed Memory Architectures , 1991, EDMCC.

[33]  Koen De Bosschere,et al.  Clock snooping and its application in on-the-fly data race detection , 1997, Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks (I-SPAN'97).

[34]  Koen De Bosschere,et al.  MPL*: Efficient Record/Play of Nondeterministic Features of Message Passing Libraries , 1999, PVM/MPI.

[35]  Jacques Chassin de Kergommeaux,et al.  Systematic assessment of the overhead of tracing parallel programs , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[36]  Mark Russinovich,et al.  Replay for concurrent non-deterministic shared-memory applications , 1996, PLDI '96.