论文信息 - Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems

xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.

Christian Engelmann | Thomas Naughton

[1] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[2] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[3] Jesús Labarta,et al. Validation of Dimemas Communication Model for MPI Collective Operations , 2000, PVM/MPI.

[4] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5] Stephen L. Scott,et al. Reliability-aware resource allocation in HPC systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[6] Christian Engelmann,et al. Facilitating co-design for extreme-scale systems through lightweight simulation , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[7] Jack Dongarra,et al. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .

[8] Matthias S. Müller,et al. The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[9] Chi-Keung Luk,et al. PinOS: a programmable framework for whole-system dynamic instrumentation , 2007, VEE '07.

[10] Chao Wang,et al. Hybrid Checkpointing for MPI Jobs in HPC Environments , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[11] German Rodriguez,et al. Trace-driven co-simulation of high-performance computing systems using OMNeT++ , 2009, SIMUTools 2009.

[12] Sarah Ellen Michalak,et al. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[13] Christian Engelmann,et al. Proactive process-level live migration in HPC environments , 2008, HiPC 2008.

[14] Cyriel Minkenberg,et al. Trace-driven co-simulation of high-performance computing systems using OMNeT++ , 2009, SimuTools.

[15] Jeffrey K. Hollingsworth,et al. An API for Runtime Code Patching , 2000, Int. J. High Perform. Comput. Appl..

[16] Fei Meng,et al. Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] Christian Engelmann,et al. xSim: The extreme-scale simulator , 2011, 2011 International Conference on High Performance Computing & Simulation.

[18] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[19] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[20] Christian Engelmann,et al. Redundant Execution of HPC Applications with MR-MPI , 2011 .

[21] Christian Engelmann,et al. Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale , 2014, Future Gener. Comput. Syst..

[22] Chao Wang,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[23] Henrique Madeira,et al. Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers , 1998, IEEE Trans. Software Eng..

[24] Henrique Madeira,et al. Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[25] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[26] Henri Casanova,et al. Single Node On-Line Simulation of MPI Applications with SMPI , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27] Chao Wang,et al. Proactive process-level live migration and back migration in HPC environments , 2012, J. Parallel Distributed Comput..

[28] Nathan DeBardeleben,et al. Towards a hardware fault-injection testbed to support reproducible resiliency experiments , 2009, Resilience '09.

[29] Christian Engelmann,et al. Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[30] Barton P. Miller,et al. Fine-grained dynamic instrumentation of commodity operating system kernels , 1999, OSDI '99.

[31] R. G. Minnich. A dynamic kernel modifier for linux , 2002 .

[32] Toni Cortes,et al. PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[33] Stephen L. Scott,et al. Benefits of Software Rejuvenation on HPC Systems , 2010, International Symposium on Parallel and Distributed Processing with Applications.

[34] Charng-Da Lu,et al. Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[35] Al Geist,et al. Major Computer Science Challenges At Exascale , 2009, Int. J. High Perform. Comput. Appl..

[36] Thomas Hérault,et al. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[37] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .

[38] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[39] Bryan Cantrill,et al. Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[40] Bruce Jacob,et al. The structural simulation toolkit , 2006, PERV.

[41] Christian Engelmann,et al. Fault injection framework for system resilience evaluation: fake faults for finding future failures , 2009, Resilience '09.