FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster
暂无分享,去创建一个
[1] Rida A. Bazzi,et al. Heterogeneous checkpointing for multithreaded applications , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..
[2] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.
[3] Shant Shahbazian,et al. Revisiting the foundations of quantum theory of atoms in molecules (QTAIM): The variational procedure and the zero‐flux conditions , 2008 .
[4] Dhabaleswar K. Panda,et al. MIC-Check: a distributed check pointing framework for the intel many integrated cores architecture , 2014, HPDC '14.
[5] Naga K. Govindaraju,et al. GPGPU: general-purpose computation on graphics hardware , 2006, SC.
[6] Canqun Yang,et al. MilkyWay-2 supercomputer: system and application , 2014, Frontiers of Computer Science.
[7] Kai Lu,et al. Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.
[8] Wang Feng,et al. Programming for scientific computing on peta-scale heterogeneous parallel systems , 2013 .
[9] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.
[10] Kevin Skadron,et al. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.
[11] Xing Cai,et al. Communication‐hiding programming for clusters with multi‐coprocessor nodes , 2015, Concurr. Comput. Pract. Exp..
[12] Laxmikant V. Kale,et al. Charm++ and AMPI: Adaptive Runtime Strategies via Migratable Objects , 2009 .
[13] David Kirk,et al. NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.
[14] Jack J. Dongarra,et al. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.
[15] Tao Tang,et al. OpenMC: Towards Simplifying Programming for TianHe Supercomputers , 2014, Journal of Computer Science and Technology.
[16] Jingling Xue,et al. PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs , 2012, Journal of Computer Science and Technology.
[17] Hiroaki Kobayashi,et al. CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[18] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.
[19] Laxmikant V. Kalé,et al. A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[20] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.
[21] Kai Lu,et al. The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.
[22] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).
[23] Daniel Marques,et al. Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.
[24] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[25] Richard Barrett,et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.