Towards Distributed Software Resilience in Asynchronous Many- Task Programming Models
暂无分享,去创建一个
[1] Hartmut Kaiser,et al. HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.
[2] Osman S. Unsal,et al. NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.
[3] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .
[4] Omer Subasi,et al. A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).
[5] Vivek Sarkar,et al. Enabling Resilience in Asynchronous Many-Task Programming Models , 2019, Euro-Par.
[6] Daniel Sunderland,et al. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..
[7] Scott Klasky,et al. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[8] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[9] Dirk Pflüger,et al. Harnessing billions of tasks for a scalable portable hydrodynamic simulation of the merger of two stars , 2019, Int. J. High Perform. Comput. Appl..
[10] Thomas Hérault,et al. Design for a Soft Error Resilient Dynamic Task-Based Runtime , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[11] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] Laxmikant V. Kalé,et al. Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.
[13] Eric Roman. A Survey of Checkpoint / Restart Implementations , 2002 .
[14] Michael A. Heroux,et al. Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.
[15] Dietmar Fey,et al. Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers , 2013, ScalA '13.
[16] Jackson R. Mayo,et al. Implementing Software Resiliency in HPX for Extreme Scale Computing , 2020, ArXiv.
[17] Thomas L. Sterling,et al. ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.
[18] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .
[19] Dietmar Fey,et al. Higher-level parallelization for local and distributed asynchronous task-based programming , 2015, ESPM '15.
[20] Jeffrey F. Naughton,et al. Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..
[21] Jeanine Cook,et al. The Performance Implication of Task Size for Applications on the HPX Runtime System , 2015, 2015 IEEE International Conference on Cluster Computing.
[22] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[23] Patrick Diehl,et al. Closing the Performance Gap with Modern C , 2016, HiPC 2016.
[24] Franck Cappello,et al. Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[25] Ignacio Laguna,et al. Reinit++: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance , 2020, ISC.
[26] Thomas Heller,et al. Application of the ParalleX execution model to stencil-based problems , 2012, Computer Science - Research and Development.
[27] Parsa Amini,et al. Assessing the Performance Impact of using an Active Global Address Space in HPX: A Case for AGAS , 2019, 2019 IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM).
[28] Dhabaleswar K. Panda,et al. EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications , 2018, Concurr. Comput. Pract. Exp..
[29] Patrick Diehl,et al. An asynchronous and task-based implementation of peridynamics utilizing HPX—the C++ standard library for parallelism and concurrency , 2018, SN Applied Sciences.
[30] George Bosilca,et al. Fault tolerance of MPI applications in exascale systems: The ULFM solution , 2020, Future Gener. Comput. Syst..
[31] Omer Subasi,et al. Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).