Efficient Execution Replay Technique for Distributed Memory Architectures

Debugging parallel programs on MIMD machines is a difficult task because successive executions of the same program can lead to different behaviors. To solve this problem, a method called execution replay has been introduced, which guarantees the reexecution of a program to be equivalent to the initial execution. In this paper we present an execution replay technique in the context of distributed memory architectures. In contrary to all other proposed approaches, our technique can treat non-blocking message passing primitives, and can be adapted to any form of message passing communication. Since the technique is based on an events numbering, we show how to bound these numbers, and then analyse the influence of this bound on the amount of recorded information. The prototype implemented on an Intel iPSC/2 shows that the overhead due to the recording of control information is extremely low (about 1%).