Error detection in large-scale parallel programs with long runtimes

Error detection is an important activity of program development, which is applied to detect incorrect computations or runtime failures of software. The costs of debugging are strongly related to the complexity and the scale of the investigated programs. Both characteristics are especially cumbersome for large-scale parallel programs with long runtimes, which are quite common in computational science and engineering (CSE) applications. A solution is offered by a combination of techniques using the event graph model as a representation of parallel program behaviour. With process isolation, a subset of the original number of processes can be investigated, while the absent processes are simulated by the debugging system. With checkpointing, an arbitrary temporal section of a program's runtime can be extracted for exhaustive analysis without the need to restart the program from the beginning. Additional benefits of the event graph are support of equivalent execution of nondeterministic programs, as well as a comprehensible visualisation as a space-time diagram.

[1]  Bernard Tourancheau,et al.  The Design of the General Parallel Monitoring System , 1992, Programming Environments for Parallel Computing.

[2]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[3]  Stuart I. Feldman,et al.  IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[4]  James S. Plank An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and , 1997 .

[5]  James S. Plank,et al.  An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .

[6]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[7]  Achour Mostéfaoui,et al.  Communication-Induced Determination of Consistent Snapshots , 1999, IEEE Trans. Parallel Distributed Syst..

[8]  Jack C. Wileden,et al.  High-level debugging of distributed systems: The behavioral abstraction approach , 1983, J. Syst. Softw..

[9]  Robert Hood The p2d2 project: building a portable distributed debugger , 1996, SPDT '96.

[10]  José C. Cunha,et al.  An experiment in tool integration: The DDBG parallel and distributed debugger , 1999, J. Syst. Archit..

[11]  Jack Dongarra,et al.  Pvm 3 user's guide and reference manual , 1993 .

[12]  Mukesh Singhal,et al.  Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems , 2001, IEEE Trans. Parallel Distributed Syst..

[13]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[14]  Dieter Kranzlmüller Incremental Tracing and Process Isolation for Debugging Parallel Programs , 2000, Comput. Artif. Intell..

[15]  Robert H. B. Netzer,et al.  Optimal tracing and incremental reexecution for debugging long-running programs , 1994, PLDI '94.

[16]  Dieter Kranzlmüller,et al.  An Integrated Record&Replay Mechanism for Nondeterministic Message Passing Programs , 2001, PVM/MPI.

[17]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[18]  Friedel Hossfeld,et al.  Teraflops Computing: A Challenge to Parallel Numerics? , 1999, ACPC.

[19]  Jong-Deok Choi,et al.  Techniques for debugging parallel programs with flowback analysis , 1991, TOPL.

[20]  Martin Stitt Debugging: Creative Techniques and Tools for Software Repair , 1992 .

[21]  Dieter Kranzlmüller,et al.  Debugging OpenMP Programs Using Event Manipulation , 2001, WOMPAT.

[22]  Dieter Kranzlmuller,et al.  Event Graph Analysis for Debugging Massively Parallel Programs , 2000 .

[23]  Francine Berman,et al.  Panorama: a portable, extensible parallel debugger , 1993, PADD '93.

[24]  Franco Zambonelli,et al.  An efficient logging algorithm for incremental replay of message-passing applications , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[25]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[26]  Robert Balzer,et al.  EXDAMS: extendable debugging and monitoring system , 1969, AFIPS '69 (Spring).

[27]  Jason Gait,et al.  A probe effect in concurrent programs , 1986, Softw. Pract. Exp..

[28]  Henryk Krawczyk,et al.  Analysis and Testing of Distributed Software Applications , 1998 .

[29]  Andreas Zeller Visual debugging with ddd , 2001 .

[30]  Eugene H. Spafford,et al.  An execution-backtracking approach to debugging , 1991, IEEE Software.