Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems

xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.

[1]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[2]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[3]  Jesús Labarta,et al.  Validation of Dimemas Communication Model for MPI Collective Operations , 2000, PVM/MPI.

[4]  Chao Wang,et al.  A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  Stephen L. Scott,et al.  Reliability-aware resource allocation in HPC systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[6]  Christian Engelmann,et al.  Facilitating co-design for extreme-scale systems through lightweight simulation , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[7]  Jack Dongarra,et al.  Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .

[8]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[9]  Chi-Keung Luk,et al.  PinOS: a programmable framework for whole-system dynamic instrumentation , 2007, VEE '07.

[10]  Chao Wang,et al.  Hybrid Checkpointing for MPI Jobs in HPC Environments , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[11]  German Rodriguez,et al.  Trace-driven co-simulation of high-performance computing systems using OMNeT++ , 2009, SIMUTools 2009.

[12]  Sarah Ellen Michalak,et al.  Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[13]  Christian Engelmann,et al.  Proactive process-level live migration in HPC environments , 2008, HiPC 2008.

[14]  Cyriel Minkenberg,et al.  Trace-driven co-simulation of high-performance computing systems using OMNeT++ , 2009, SimuTools.

[15]  Jeffrey K. Hollingsworth,et al.  An API for Runtime Code Patching , 2000, Int. J. High Perform. Comput. Appl..

[16]  Fei Meng,et al.  Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Christian Engelmann,et al.  xSim: The extreme-scale simulator , 2011, 2011 International Conference on High Performance Computing & Simulation.

[18]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[19]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[20]  Christian Engelmann,et al.  Redundant Execution of HPC Applications with MR-MPI , 2011 .

[21]  Christian Engelmann,et al.  Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale , 2014, Future Gener. Comput. Syst..

[22]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Henrique Madeira,et al.  Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers , 1998, IEEE Trans. Software Eng..

[24]  Henrique Madeira,et al.  Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[25]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[26]  Henri Casanova,et al.  Single Node On-Line Simulation of MPI Applications with SMPI , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27]  Chao Wang,et al.  Proactive process-level live migration and back migration in HPC environments , 2012, J. Parallel Distributed Comput..

[28]  Nathan DeBardeleben,et al.  Towards a hardware fault-injection testbed to support reproducible resiliency experiments , 2009, Resilience '09.

[29]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[30]  Barton P. Miller,et al.  Fine-grained dynamic instrumentation of commodity operating system kernels , 1999, OSDI '99.

[31]  R. G. Minnich A dynamic kernel modifier for linux , 2002 .

[32]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[33]  Stephen L. Scott,et al.  Benefits of Software Rejuvenation on HPC Systems , 2010, International Symposium on Parallel and Distributed Processing with Applications.

[34]  Charng-Da Lu,et al.  Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[35]  Al Geist,et al.  Major Computer Science Challenges At Exascale , 2009, Int. J. High Perform. Comput. Appl..

[36]  Thomas Hérault,et al.  Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[37]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[38]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[39]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[40]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[41]  Christian Engelmann,et al.  Fault injection framework for system resilience evaluation: fake faults for finding future failures , 2009, Resilience '09.