Ant: A Debugging Framework for MPI Parallel Programs

This paper describes Ant, a debugging framework targeting MPI parallel programs. The Ant framework statically analyzes programs, marking code regions as being executed by all processes or executed by only some of the processes. The analyzed program is then instrumented with calls to an invariant violation monitoring and detection library. The analysis allows regions to be instrumented based on whether all, or less than all, processes execute the region. Ant’s instrumentation strategy allows sampled monitoring across processes in regions executed by all processes. We present a case study using Ant with C-DIDUCE (a variant of DIDUCE for C) to find violations of value invariants in parallel C/MPI programs. Ant’s instrumentation strategy reduces the overhead of monitoring by over 14 times with less impact on accuracy than a scheme that simply distributes monitoring over all processes executing the program.

[1]  Dhabaleswar K. Panda,et al.  DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[2]  Sudheendra Hangal,et al.  Tracking down software bugs using automatic anomaly detection , 2002, ICSE '02.

[3]  Chao Liu,et al.  SOBER: statistical model-based bug localization , 2005, ESEC/FSE-13.

[4]  Martin Schulz,et al.  Stack Trace Analysis for Large Scale Debugging , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Katherine A. Yelick,et al.  Concurrency Analysis for Parallel Programs with Textually Aligned Barriers , 2005, LCPC.

[6]  Barton P. Miller,et al.  Problem Diagnosis in Large-Scale Computing Environments , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[7]  Philippe Olivier Alexandre Navaux,et al.  A Selection Mechanism to Group Processes in a Parallel Debugger , 2000, PDPTA.

[8]  Michael I. Jordan,et al.  Bug isolation via remote program sampling , 2003, PLDI.

[9]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[10]  Daniel M. Yellin,et al.  Hermes - a language for distributed computing , 1991, Prentice Hall series in innovative technology.

[11]  Samuel P. Midkiff,et al.  Artemis: practical runtime monitoring of applications for execution anomalies , 2006, PLDI '06.

[12]  Walter F. Tichy,et al.  Proceedings 25th International Conference on Software Engineering , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[13]  Martin Schulz,et al.  Lessons learned at 208K: towards debugging millions of cores , 2008, HiPC 2008.

[14]  Thomas J. Ostrand,et al.  Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria , 1994, Proceedings of 16th International Conference on Software Engineering.

[15]  Wei Liu,et al.  AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[16]  Robert Hood,et al.  A portable debugger for parallel and distributed programs , 1994, Proceedings of Supercomputing '94.

[17]  William G. Griswold,et al.  Quickly detecting relevant program invariants , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[18]  David E. Culler,et al.  The Mantis parallel debugger , 1996, SPDT '96.

[19]  Michael Oberhuber,et al.  Interactive Debugging and Performance Analysis of Massively Parallel Applications , 1996, Parallel Comput..

[20]  Steve Sistare,et al.  MPI support in the Prism programming environment , 1999, SC '99.