Rolt/sup MP/-replay of Lamport timestamps for message passing systems

Debugging nondeterministic parallel programs is rather difficult, because consecutive runs with the same input data may result in different executions. To overcome these problems for cyclic debugging, replay mechanisms based on trace driven simulation have been developed. As replay is based on a previously monitored program run, the overhead generated by the monitoring functionality is rather critical. It has to be small enough in order to keep the intrusion on the program as low as possible. An example of such a replay mechanism with low intrusion is the ROLT method, which was originally developed for shared memory systems. This method uses Lamport clocks to trace the order of accesses to shared objects. Although processes in message passing systems interact completely different, some ideas of ROLT are useful and can be ported to the distributed memory area. As a result an improved monitoring and replay approach with a lower overhead compared to other existing methods can be implemented.

[1]  André Schiper,et al.  Efficient Execution Replay Technique for Distributed Memory Architectures , 1991, EDMCC.

[2]  Luk Levrouw,et al.  Minimizing the Log Size for Execution Replay of Shared-Memory Programs , 1994, CONPAR.

[3]  David F. Snelling,et al.  A comparative study of libraries for parallel processing , 1988, Parallel Comput..

[4]  Luk Levrouw,et al.  A New Trace And Replay System For Shared Memory Programs Based On Lamport Clocks , 1994, Proceedings. Second Euromicro Workshop on Parallel and Distributed Processing.

[5]  Jacques Chassin de Kergommeaux,et al.  Systematic assessment of the overhead of tracing parallel programs , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[6]  Dieter Kranzlmüller,et al.  Monitoring strategies for hypercube systems , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[7]  André Schiper,et al.  ParaRex: a programming environment integrating execution replay and visualization , 1993 .

[8]  Jason Gait,et al.  A probe effect in concurrent programs , 1986, Softw. Pract. Exp..

[9]  Edith Schonberg,et al.  On-the-fly detection of access anomalies , 2018, PLDI '89.

[10]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[11]  Robert H. B. Netzer Optimal tracing and replay for debugging shared-memory parallel programs , 1993, PADD '93.

[12]  André Schiper,et al.  Execution replay on distributed memory architectures , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[13]  Robert H. B. Netzer,et al.  Debugging race conditions in message-passing programs , 1996, SPDT '96.

[14]  Luk Levrouw,et al.  Efficient coding of execution-traces of parallel programs , 1995 .