Parastack: efficient hang detection for MPI programs at large scale

While program hangs on large parallel systems can be detected via the widely used timeout mechanism, it is difficult for the users to set the timeout - too small a timeout leads to high false alarm rates and too large a timeout wastes a vast amount of valuable computing resources. To address the above problems with hang detection, this paper presents ParaStack, an extremely lightweight tool to detect hangs in a timely manner with high accuracy, negligible overhead with great scalability, and without requiring the user to select a timeout value. For a detected hang, it provides direction for further analysis by telling users whether the hang is the result of an error in the computation phase or the communication phase. For a computation-error induced hang, our tool pinpoints the faulty process by excluding hundreds and thousands of other processes. We have adapted ParaStack to work with the Torque and Slurm parallel batch schedulers and validated its functionality and performance on Tianhe-2 and Stampede that are respectively the world's current 2nd and 12th fastest supercomputers. Experimental results demonstrate that ParaStack detects hangs in a timely manner at negligible overhead with over 99% accuracy. No false alarm is observed in correct runs taking 66 hours at scale of 256 processes and 39.7 hours at scale of 1024 processes. ParaStack accurately reports the faulty process for computation-error induced hangs.

[1]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[2]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[3]  Qi Gao,et al.  FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Martin Schulz,et al.  Accurate application progress analysis for large-scale parallel debugging , 2014, PLDI.

[5]  C. Eisenhart,et al.  Tables for Testing Randomness of Grouping in a Sequence of Alternatives , 1943 .

[6]  Bowen Zhou,et al.  Vrisha: using scaling properties of parallel programs for bug detection and localization , 2011, HPDC '11.

[7]  Werner Krotz-Vogel,et al.  Automated MPI Correctness Checking What if there was a magic option ? , 2007 .

[8]  Michael M. Resch,et al.  MARMOT: An MPI Analysis and Checking Tool , 2003, PARCO.

[9]  Martin Schulz,et al.  Large scale debugging of parallel tasks with AutomaDeD , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Dhabaleswar K. Panda,et al.  DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[11]  Bronis R. de Supinski,et al.  Dynamic Software Testing of MPI Applications with Umpire , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[12]  Martin Schulz,et al.  Scalable temporal order analysis for large scale debugging , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[13]  Martin Schulz,et al.  Stack Trace Analysis for Large Scale Debugging , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[14]  Feng Qin,et al.  SyncChecker: Detecting Synchronization Errors between MPI Applications and Libraries , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[15]  Ganesh Gopalakrishnan,et al.  ISP: a tool for model checking MPI programs , 2008, PPOPP.

[16]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[17]  Sandia Report,et al.  Toward a New Metric for Ranking High Performance Computing Systems , 2013 .

[18]  James Coyle,et al.  Deadlock detection in MPI programs , 2002, Concurr. Comput. Pract. Exp..

[19]  Victor Samofalov,et al.  Automated, scalable debugging of MPI programs with Intel® Message Checker , 2005, SE-HPCS '05.

[20]  Christel Baier,et al.  Distributed wait state tracking for runtime MPI deadlock detection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Martin Schulz,et al.  AutomaDeD: Automata-based debugging for dissimilar parallel tasks , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[22]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[23]  Martin Schulz,et al.  Debugging high-performance computing applications at massive scales , 2015, Commun. ACM.

[24]  Bronis R. de Supinski,et al.  Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference , 2015, IEEE Transactions on Parallel and Distributed Systems.

[25]  Bowen Zhou,et al.  WuKong: automatically detecting and localizing bugs that manifest at large system scales , 2013, HPDC '13.

[26]  Jun Wei,et al.  MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Barton P. Miller,et al.  Problem Diagnosis in Large-Scale Computing Environments , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[28]  Haibo Chen,et al.  Why software hangs and what can be done with it , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[29]  Bronis R. de Supinski,et al.  Probabilistic diagnosis of performance faults in large-scale parallel applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[30]  W. Haque Concurrent Deadlock Detection In Parallel Programs , 2006 .

[31]  Martin Schulz,et al.  A graph based approach for MPI deadlock detection , 2009, ICS '09.