Debugging Parallel Programs with Instant Replay

The debugging cycle is the most common methodology for finding and correcting errors in sequential programs. Cyclic debugging is effective because sequential programs are usually deterministic. Debugging parallel programs is considerably more difficult because successive executions of the same program often do not produce the same results. In this paper we present a general solution for reproducing the execution behavior of parallel programs, termed Instant Replay. During program execution we save the relative order of significant events as they occur, not the data associated with such events. As a result, our approach requires less time and space to save the information needed for program replay than other methods. Our technique is not dependent on any particular form of interprocess communication. It provides for replay of an entire program, rather than individual processes in isolation. No centralized bottlenecks are introduced and there is no need for synchronized clocks or a globally consistent logical time. We describe a prototype implementation of Instant Replay on the BBN Butterfly™ Parallel Processor, and discuss how it can be incorporated into the debugging cycle for parallel programs.

[1]  Gary L. Peterson,et al.  Concurrent Reading While Writing , 1983, TOPL.

[2]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3]  C. A. R. Hoare,et al.  Monitors: an operating system structuring concept , 1974, CACM.

[4]  Thomas J. LeBlanc Shared Memory Versus Message-Passing in a Tightly-Coupled Multiprocessor: A Case Study , 1986, ICPP.

[5]  Jong-Deok Choi,et al.  Breakpoints and halting in distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[6]  David Lorge Parnas,et al.  Concurrent control with “readers” and “writers” , 1971, CACM.

[7]  Richard H. Carver,et al.  Reproducible Testing of Concurrent Programs Based on Shared Variables , 1986, ICDCS.

[8]  Edsger W. Dijkstra,et al.  The structure of the “THE”-multiprogramming system , 1968, CACM.

[9]  Larry D. Wittie,et al.  BUGNET: A Debugging system for parallel programming environments , 1982, ICDCS.

[10]  Baruch Awerbuch,et al.  Atomic shared register access by asynchronous hardware , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[11]  S. Y. Chiu DEBUGGING DISTRIBUTED COMPUTATIONS IN A NESTED ATOMIC ACTION SYSTEM , 1984 .

[12]  Michael L. Scott The Interface Between Distributed Operating System and High-Level Programming Language , 1986, ICPP.

[13]  F. Baiardi,et al.  Development of a debugger for a concurrent language , 1986, IEEE Transactions on Software Engineering.

[14]  Thomas J. LeBlanc,et al.  SMP: A Message-Based Programming Environment for the BBN Butterfly , 1986 .

[15]  Jack C. Wileden,et al.  High-level debugging of distributed systems: The behavioral abstraction approach , 1983, J. Syst. Softw..

[16]  R. D. Schiffenbauer INTERACTIVE DEBUGGING IN A DISTRIBUTED COMPUTATIONAL , 1981 .

[17]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[18]  Gigliola Vaglini,et al.  Development of a Debugger for a Concurrent Language , 1986, IEEE Trans. Software Eng..

[19]  Edward T. Smith Debugging Tools for Message-Based, Communicating Processes , 1984, ICDCS.

[20]  Barton P Miller Performance Characterization of Distributed Programs , 1984 .

[21]  Hector Garcia-Molina,et al.  Debugging a Distributed Computing System , 1984, IEEE Transactions on Software Engineering.

[22]  Richard J. LeBlanc,et al.  Event-Driven Monitoring of Distributed Programs , 1985, ICDCS.