Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System
暂无分享,去创建一个
Akihiro Hayashi | Keita Teranishi | Jackson Mayo | Sri Raj Paul | Vivek Sarkar | Matthew Whitlock | Seonmyeong Bak | Max Grossman
[1] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[2] Vivek Sarkar,et al. A Pluggable Framework for Composable HPC Scheduling Libraries , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[3] Thomas L. Sterling,et al. ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.
[4] Vivek Sarkar,et al. Enabling Resilience in Asynchronous Many-Task Programming Models , 2019, Euro-Par.
[5] Martin Berzins,et al. ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms , 2015 .
[6] Alexander Aiken,et al. Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[7] Dhabaleswar K. Panda,et al. EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications , 2018, Concurr. Comput. Pract. Exp..
[8] Jinsuk Chung,et al. Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[9] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[10] Michael A. Heroux,et al. Fenix, A Fault Tolerant Programming Framework for MPI Applications , 2016 .
[11] Eduard Ayguadé,et al. Task-Based Programming with OmpSs and Its Application , 2014, Euro-Par Workshops.
[12] Omer Subasi,et al. Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[13] Thomas Hérault,et al. Design for a Soft Error Resilient Dynamic Task-Based Runtime , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[14] Scott Klasky,et al. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[15] Ravishankar K. Iyer,et al. Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[16] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[17] Nathan R. Tallent,et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org , 2010 .
[18] Martin Schulz,et al. Evaluating and extending user-level fault tolerance in MPI applications , 2016, Int. J. High Perform. Comput. Appl..
[19] Daniel S. Katz,et al. Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques , 2016, 2016 45th International Conference on Parallel Processing Workshops (ICPPW).
[20] Benoît Meister,et al. The Open Community Runtime: A runtime system for extreme scale computing , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).
[21] Ignacio Laguna,et al. Reinit++: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance , 2020, ISC.
[22] Alejandro Duran,et al. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..
[23] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[24] Thomas Hérault,et al. PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.
[25] Osman S. Unsal,et al. NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[26] B. Bouteiller,et al. MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[27] Thomas Hérault,et al. Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..