论文信息 - Efficient Execution Replay Technique for Distributed Memory Architectures

Efficient Execution Replay Technique for Distributed Memory Architectures

Debugging parallel programs on MIMD machines is a difficult task because successive executions of the same program can lead to different behaviors. To solve this problem, a method called execution replay has been introduced, which guarantees the reexecution of a program to be equivalent to the initial execution. In this paper we present an execution replay technique in the context of distributed memory architectures. In contrary to all other proposed approaches, our technique can treat non-blocking message passing primitives, and can be adapted to any form of message passing communication. Since the technique is based on an events numbering, we show how to bound these numbers, and then analyse the influence of this bound on the amount of recorded information. The prototype implemented on an Intel iPSC/2 shows that the overhead due to the recording of control information is extremely low (about 1%).

André Schiper | Eric Leu | Abdel Wahab Zramdini

[1] Wanlei Zhou. PM: a system for prototyping and monitoring remote procedure call programs , 1990, SOEN.

[2] Thomas J. LeBlanc,et al. Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.

[3] Stuart I. Feldman,et al. IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[4] Geoffrey C. Fox,et al. Matrix algorithms on a hypercube I: Matrix multiplication , 1987, Parallel Comput..

[5] A. J. Wellings,et al. Debugging distributed real-time applications: a case study in ada , 1990 .

[6] Richard J. LeBlanc,et al. Event-Driven Monitoring of Distributed Programs , 1985, ICDCS.

[7] Willy Zwaenepoel,et al. Causal distributed breakpoints , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[8] Robert J. Fowler,et al. An integrated approach to parallel program debugging and performance analysis onlarge-scale multiprocessors , 1988, PADD '88.

[9] M. A. Bramer. Computer Game - Playing: Theory and Practice , 1983 .

[10] Mark A. Linton,et al. Supporting reverse execution for parallel programs , 1988, PADD '88.

[11] Larry D. Wittie,et al. BUGNET: A Debugging system for parallel programming environments , 1982, ICDCS.