TeaMPI—Replication-Based Resilience Without the (Performance) Pain
暂无分享,去创建一个
[1] Michael Bader,et al. Influence of A-Posteriori Subcell Limiting on Fault Frequency in Higher-Order DG Schemes , 2018, 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS).
[2] Jack Dongarra,et al. Applied Mathematics Research for Exascale Computing , 2014 .
[3] Michael Dumbser,et al. ExaHyPE: An Engine for Parallel Dynamically Adaptive Simulations of Wave Problems , 2019, Comput. Phys. Commun..
[4] Jinsuk Chung,et al. Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[6] Omer Subasi,et al. Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[7] Christian Engelmann,et al. Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale , 2014, Future Gener. Comput. Syst..
[8] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[9] Frank Mueller,et al. End-to-End Resilience for HPC Applications , 2019, ISC.
[10] Philipp Samfass,et al. Tasks Unlimited: Lightweight Task Offloading Exploiting MPI Wait Times for Parallel Adaptive Mesh Refinement , 2019, ArXiv.
[11] Jannis Klinkenberg,et al. CHAMELEON: Reactive Load Balancing for Hybrid MPI+OpenMP Task-Parallel Applications , 2020, J. Parallel Distributed Comput..
[12] Michael Dumbser,et al. Studies on the energy and deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver , 2018, Int. J. High Perform. Comput. Appl..
[13] Martin Schulz,et al. Exploiting Data Similarity to Reduce Memory Footprints , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[14] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..
[15] Dominik Göddeke,et al. Soft fault detection and correction for multigrid , 2018, Int. J. High Perform. Comput. Appl..
[16] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[17] Thomas Hérault,et al. Design for a Soft Error Resilient Dynamic Task-Based Runtime , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[18] Jannis Klinkenberg,et al. Hybrid MPI+OpenMP Reactive Work Stealing in Distributed Memory in the PDE Framework sam(oa)^2 , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[19] Zizhong Chen,et al. Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing , 2005, Int. J. High Perform. Comput. Appl..
[20] Philipp Samfass,et al. Lightweight task offloading exploiting MPI wait times for parallel adaptive mesh refinement , 2020, Concurr. Comput. Pract. Exp..
[21] Torsten Hoefler,et al. Message progression in parallel computing - to thread or not to thread? , 2008, 2008 IEEE International Conference on Cluster Computing.
[22] Dirk Ribbrock,et al. Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing , 2015, Parallel Comput..
[23] Dirk Pflüger,et al. A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs , 2016, Euro-Par Workshops.
[24] Michael Dumbser,et al. A simple diffuse interface approach on adaptive Cartesian grids for the linear elastic wave equations with complex topography , 2018, J. Comput. Phys..
[25] Tobias Weinzierl,et al. Enclave Tasking for Discontinuous Galerkin Methods on Dynamically Adaptive Meshes , 2018, SIAM J. Sci. Comput..
[26] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[27] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.
[28] Tyler A. Simon,et al. Improving Application Resilience through Probabilistic Task Replication , 2013 .
[29] Christian Engelmann,et al. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems , 2009 .
[30] Rolf Riesen,et al. See applications run and throughput jump: The case for redundant computing in HPC , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).