When is multi-version checkpointing needed?
暂无分享,去创建一个
[1] Sally A. McKee,et al. ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[2] James H. Laros,et al. Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[3] Babak Falsafi,et al. ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications , 2010, ASPLOS XV.
[4] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[5] Narayan Desai,et al. Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[6] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[7] Padma Raghavan,et al. Characterizing the impact of soft errors on iterative methods in scientific computing , 2011, ICS '11.
[8] Vilas Sridharan,et al. A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[9] Andrew A. Chien,et al. The future of microprocessors , 2011, Commun. ACM.
[10] Robert B. Ross,et al. ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization , 2012, HPDC '12.
[11] Andrew A. Chien,et al. An evaluation of difference and threshold techniques for efficient checkpoints , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).
[12] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[13] Bronis R. de Supinski,et al. McrEngine: a scalable checkpointing system using data-aware aggregation and compression , 2012, HiPC 2012.
[14] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[15] Todd C. Mowry,et al. Log-based architectures: using multicore to help software behave correctly , 2011, OPSR.
[16] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[17] Bronis R. de Supinski,et al. Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.
[18] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[19] Stephen L. Scott,et al. An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[20] Robert Latham,et al. ISOBAR Preconditioner for Effective and High-throughput Lossless Data Compression , 2012, 2012 IEEE 28th International Conference on Data Engineering.
[21] Sarita V. Adve,et al. Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[22] W YoungJohn. A first order approximation to the optimum checkpoint interval , 1974 .
[23] Thomas Hérault,et al. MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..
[24] David Fiala. Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[25] Bianca Schroeder,et al. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.
[26] Charng-Da Lu,et al. Assessing Fault Sensitivity in MPI Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[27] Milos Prvulovic,et al. Euripus: A flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[28] Ron Brightwell,et al. Cooperative Application/OS DRAM Fault Recovery , 2011, Euro-Par Workshops.
[29] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.
[30] Jason Duell,et al. Requirements for Linux Checkpoint/Restart , 2002 .
[31] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.
[32] John T. Daly,et al. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.
[33] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..
[34] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.