Execution replay on distributed memory architectures

Debugging parallel programs on MIMD machines is a difficult task because successive executions of the same program can lead to different behaviors. To solve this problem, a method called execution replay has been introduced, which guarantees the reexecution of a program to be equivalent to the initial execution. Most of execution replay techniques proposed until now may be named 'data driven techniques'. Such techniques are relatively easy to implement in the case of the most common communication primitives. However, the time needed to record the large amount of required information is significant, which might modify the initial execution. Execution replay becomes in this case meaningless. Another class of execution replay named control driven execution replay allows one to limit the amount of recorded information. The paper presents a solution of the class control driven which realizes execution replay on distributed memory architectures. In contrary to all other proposed approaches, the technique is adapted to nonblocking primitives, and is not dependent on any form of message passing communication.<<ETX>>