Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems
暂无分享,去创建一个
[1] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[2] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[3] Jesús Labarta,et al. Validation of Dimemas Communication Model for MPI Collective Operations , 2000, PVM/MPI.
[4] Chao Wang,et al. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[5] Stephen L. Scott,et al. Reliability-aware resource allocation in HPC systems , 2007, 2007 IEEE International Conference on Cluster Computing.
[6] Christian Engelmann,et al. Facilitating co-design for extreme-scale systems through lightweight simulation , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).
[7] Jack Dongarra,et al. Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems , 2004 .
[8] Matthias S. Müller,et al. The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.
[9] Chi-Keung Luk,et al. PinOS: a programmable framework for whole-system dynamic instrumentation , 2007, VEE '07.
[10] Chao Wang,et al. Hybrid Checkpointing for MPI Jobs in HPC Environments , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.
[11] German Rodriguez,et al. Trace-driven co-simulation of high-performance computing systems using OMNeT++ , 2009, SIMUTools 2009.
[12] Sarah Ellen Michalak,et al. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[13] Christian Engelmann,et al. Proactive process-level live migration in HPC environments , 2008, HiPC 2008.
[14] Cyriel Minkenberg,et al. Trace-driven co-simulation of high-performance computing systems using OMNeT++ , 2009, SimuTools.
[15] Jeffrey K. Hollingsworth,et al. An API for Runtime Code Patching , 2000, Int. J. High Perform. Comput. Appl..
[16] Fei Meng,et al. Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[17] Christian Engelmann,et al. xSim: The extreme-scale simulator , 2011, 2011 International Conference on High Performance Computing & Simulation.
[18] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[19] Franck Cappello,et al. Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..
[20] Christian Engelmann,et al. Redundant Execution of HPC Applications with MR-MPI , 2011 .
[21] Christian Engelmann,et al. Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale , 2014, Future Gener. Comput. Syst..
[22] Chao Wang,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] Henrique Madeira,et al. Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers , 1998, IEEE Trans. Software Eng..
[24] Henrique Madeira,et al. Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.
[25] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[26] Henri Casanova,et al. Single Node On-Line Simulation of MPI Applications with SMPI , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[27] Chao Wang,et al. Proactive process-level live migration and back migration in HPC environments , 2012, J. Parallel Distributed Comput..
[28] Nathan DeBardeleben,et al. Towards a hardware fault-injection testbed to support reproducible resiliency experiments , 2009, Resilience '09.
[29] Christian Engelmann,et al. Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.
[30] Barton P. Miller,et al. Fine-grained dynamic instrumentation of commodity operating system kernels , 1999, OSDI '99.
[31] R. G. Minnich. A dynamic kernel modifier for linux , 2002 .
[32] Toni Cortes,et al. PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .
[33] Stephen L. Scott,et al. Benefits of Software Rejuvenation on HPC Systems , 2010, International Symposium on Parallel and Distributed Processing with Applications.
[34] Charng-Da Lu,et al. Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[35] Al Geist,et al. Major Computer Science Challenges At Exascale , 2009, Int. J. High Perform. Comput. Appl..
[36] Thomas Hérault,et al. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[37] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[38] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.
[39] Bryan Cantrill,et al. Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.
[40] Bruce Jacob,et al. The structural simulation toolkit , 2006, PERV.
[41] Christian Engelmann,et al. Fault injection framework for system resilience evaluation: fake faults for finding future failures , 2009, Resilience '09.