Using Performance Tools to Support Experiments in HPC Resilience

The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between “performance tools” and “resilience tools”. As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community.

[1]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  D. Quinlan,et al.  Inter-Agency Workshop on HPC Resilience at Extreme Scale National Security Agency Advanced Computing Systems February 21 – 24 , 2012 Coordinating Representatives John Daly ( DOD ) Bill Harrod ( DOE / SC ) Thuc Hoang ( DOE / NNSA , 2012 .

[3]  Martin Schulz,et al.  Scalable temporal order analysis for large scale debugging , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  R. S. Sawhney,et al.  Performance Evaluation of QoS parameters in UMTS Network Using Qualnet , 2010 .

[5]  Ali Pinar,et al.  A Simulator for Large-Scale Parallel Computer Architectures , 2010, Int. J. Distributed Syst. Technol..

[6]  Thomas Hérault,et al.  An Evaluation of User-Level Failure Mitigation Support in MPI , 2012, EuroMPI.

[7]  Martin Schulz,et al.  Large scale debugging of parallel tasks with AutomaDeD , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Jeffrey M. Squyres,et al.  Checkpoint/Restart-Enabled Parallel Debugging , 2010, EuroMPI.

[9]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[10]  Jack Dongarra,et al.  Recent Advances in the Message Passing Interface - 17th European MPI Users' Group Meeting, EuroMPI 2010, Stuttgart, Germany, September 12-15, 2010. Proceedings , 2010, EuroMPI.

[11]  Christian Engelmann,et al.  xSim: The extreme-scale simulator , 2011, 2011 International Conference on High Performance Computing & Simulation.