FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Many MPI libraries have suffered from software bugs, which severely impact the productivity of a large number of users. This paper presents a new method called FlowChecker for detecting communication-related bugs inMPI libraries. The main idea is to extract program intentions of message passing (MPintentions), and to check whether theseMP-intentions are fulfilled correctly by the underlying MPI libraries, i.e., whether messages are delivered correctly from specified sources to specified destinations. If not, FlowChecker reports the bugs and provides diagnostic information. We have built a FlowChecker prototype on Linux and evaluated it with five real-world bug cases in three widely-used MPI libraries, including Open MPI, MPICH2, and MVAPICH2. Our experimental results show that FlowChecker effectively detects all five evaluated bug cases and provides useful diagnostic information. Additionally, our experiments with HPL and NPB show that FlowChecker incurs low runtime overhead (0.9-9.7% on three MPI libraries).

[1]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[2]  Sudheendra Hangal,et al.  Tracking down software bugs using automatic anomaly detection , 2002, ICSE '02.

[3]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[4]  Martin Schulz,et al.  Stack Trace Analysis for Large Scale Debugging , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Suku Nair,et al.  Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection , 1999, IEEE Trans. Parallel Distributed Syst..

[6]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[7]  Domenico Cotroneo,et al.  Software Faults Diagnosis in Complex OTS Based Safety Critical Systems , 2008, 2008 Seventh European Dependable Computing Conference.

[8]  Wei Liu,et al.  AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[9]  Barton P. Miller,et al.  Problem Diagnosis in Large-Scale Computing Environments , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[10]  David H. Bailey,et al.  NAS parallel benchmark results , 1992, Proceedings Supercomputing '92.

[11]  Rolf Riesen,et al.  Lightweight I/O for Scientific Applications , 2006, 2006 IEEE International Conference on Cluster Computing.

[12]  Sriram K. Rajamani,et al.  The SLAM project: debugging system software via static analysis , 2002, POPL '02.

[13]  Hua Chen,et al.  MPI‐CHECK: a tool for checking Fortran 90 MPI programs , 2003, Concurr. Comput. Pract. Exp..

[14]  Neeraj Suri,et al.  On the use of formal techniques for validation , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[15]  William Gropp,et al.  A Portable Method for Finding User Errors in the Usage of MPI Collective Operations , 2007, Int. J. High Perform. Comput. Appl..

[16]  Bronis R. de Supinski,et al.  Dynamic Software Testing of MPI Applications with Umpire , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[17]  Martin Schulz,et al.  Scalable temporal order analysis for large scale debugging , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[18]  Christof Fetzer,et al.  Speculation for Parallelizing Runtime Checks , 2009, SSS.

[19]  Christof Fetzer,et al.  Assertion-Driven Development: Assessing the Quality of Contracts Using Meta-Mutations , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[20]  Rajeev Thakur,et al.  Formal verification of practical MPI programs , 2009, PPoPP '09.

[21]  David H. Bailey,et al.  NAS parallel benchmark results , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[22]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multithreaded programs , 1997, TOCS.

[23]  David E. Culler,et al.  The Mantis parallel debugger , 1996, SPDT '96.

[24]  Robert O. Hastings,et al.  Fast detection of memory leaks and access errors , 1991 .

[25]  Yang Meng Tan,et al.  LCLint: a tool for using specifications to check code , 1994, SIGSOFT '94.

[26]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[27]  Martin Schulz,et al.  ScalaTrace: Scalable compression and replay of communication traces for high-performance computing , 2008, J. Parallel Distributed Comput..

[28]  Eli Tilevich,et al.  Enhancing source-level programming tools with an awareness of transparent program transformations , 2009, OOPSLA '09.

[29]  George C. Necula,et al.  CCured in the real world , 2003, PLDI '03.

[30]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[31]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[32]  Mark A. Taylor,et al.  Network Fault Tolerance in LA-MPI , 2003, PVM/MPI.

[33]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[34]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[35]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[36]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[37]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[38]  G Bronevetsky,et al.  Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O , 2009 .

[39]  Martin Schulz,et al.  A graph based approach for MPI deadlock detection , 2009, ICS '09.

[40]  David LaFrance-Linden,et al.  Extending a traditional debugger to debug massively parallel applications , 2004, J. Parallel Distributed Comput..

[41]  Aram Perez,et al.  Byte-Wise CRC Calculations , 1983, IEEE Micro.

[42]  Satoshi Matsuoka,et al.  Model-based fault localization in large-scale computing systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[43]  Michael M. Resch,et al.  MARMOT: An MPI Analysis and Checking Tool , 2003, PARCO.

[44]  Miguel Castro,et al.  Better bug reporting with better privacy , 2008, ASPLOS 2008.

[45]  Victor Samofalov,et al.  Automated, scalable debugging of MPI programs with Intel® Message Checker , 2005, SE-HPCS '05.

[46]  George S. Avrunin,et al.  Combining symbolic execution with model checking to verify parallel numerical programs , 2008, TSEM.

[47]  Dhabaleswar K. Panda,et al.  DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[48]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.