Improving performance of iterative methods by lossy checkponting
暂无分享,去创建一个
Franck Cappello | Dingwen Tao | Sheng Di | Zizhong Chen | Xin Liang | F. Cappello | Zizhong Chen | Xin Liang | Dingwen Tao | S. Di
[1] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[2] Dingwen Tao,et al. Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra , 2016, HPDC.
[3] Katja Bachmeier,et al. Numerical Heat Transfer And Fluid Flow , 2016 .
[4] Shuaiwen Song,et al. New-Sum: A Novel Online ABFT Scheme For General Iterative Methods , 2016, HPDC.
[5] Kurt B. Ferreira,et al. Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.
[6] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[7] Frank Mueller,et al. Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[8] Bohn Stafleu van Loghum,et al. Online … , 2002, LOG IN.
[9] Wei Ge,et al. The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.
[10] Dingwen Tao,et al. Correcting soft errors online in fast fourier transform , 2017, SC.
[11] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..
[12] John Shalf,et al. On the Role of Co-design in High Performance Computing , 2012, High Performance Computing Workshop.
[13] Dingwen Tao,et al. Silent Data Corruption Resilient Two-sided Matrix Factorizations , 2017, PPoPP.
[14] Zuoning Chen,et al. A Large-Scale Study of Failures on Petascale Supercomputers , 2018, Journal of Computer Science and Technology.
[15] Raphaël Couturier,et al. Parallel Iterative Algorithms: From Sequential to Grid Computing (Chapman & Hall/crc Numerical Analy & Scient Comp. Series) , 2007 .
[16] Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.
[17] Richard W. Vuduc,et al. Self-stabilizing iterative solvers , 2013, ScalA '13.
[18] Franck Cappello,et al. Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Olaf Schenk,et al. Inertia-Revealing Preconditioning For Large-Scale Nonconvex Constrained Optimization , 2008, SIAM J. Sci. Comput..
[20] Zizhong Chen,et al. Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[21] Seung Woo Son,et al. NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[22] Franck Cappello,et al. Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[23] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.
[24] Ugo Becciani,et al. Solving a very large-scale sparse linear system with a parallel algorithm in the Gaia mission , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).
[25] Pradip Bose,et al. Experience report: An application-specific checkpointing technique for minimizing checkpoint corruption , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).
[26] Y. Saad,et al. GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .
[27] Jack Poulson,et al. Scientific computing , 2013, XRDS.
[28] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).
[29] Rajeev Thakur,et al. On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.
[30] Martin Burtscher,et al. Fast lossless compression of scientific floating-point data , 2006, Data Compression Conference (DCC'06).
[31] Tamara G. Kolda,et al. Parallel Tensor Compression for Large-Scale Scientific Data , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[32] Peter Lindstrom,et al. Fixed-Rate Compressed Floating-Point Arrays , 2014, IEEE Transactions on Visualization and Computer Graphics.
[33] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[34] Robert Latham,et al. ISABELA for effective in situ compression of scientific data , 2013, Concurr. Comput. Pract. Exp..
[35] Josue Mora Acosta,et al. Numerical algorithms for three dimensional computational fluid dynamic problems , 2001 .
[36] Bronis R. de Supinski,et al. MCREngine: A scalable checkpointing system using data-aware aggregation and compression , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[37] Martin Isenburg,et al. Fast and Efficient Compression of Floating-Point Data , 2006, IEEE Transactions on Visualization and Computer Graphics.
[38] Franck Cappello,et al. Exploring the feasibility of lossy compression for PDE simulations , 2019, Int. J. High Perform. Comput. Appl..
[39] A. Chorin. Numerical solution of the Navier-Stokes equations , 1968 .
[40] George Bosilca,et al. Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..
[41] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[42] Franck Cappello,et al. Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.
[43] William Gropp,et al. The blue waters super-system for super-science , 2013 .
[44] Zizhong Chen,et al. FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines , 2014, HPDC '14.
[45] Peter Deutsch,et al. GZIP file format specification version 4.3 , 1996, RFC.
[46] Emmanuel Agullo,et al. Towards resilient parallel linear Krylov solvers: recover-restart strategies , 2013 .
[47] Richard Barrett,et al. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.
[48] William Gropp,et al. PETSc Users Manual Revision 3.4 , 2016 .
[49] Franck Cappello,et al. Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[50] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[51] Duan Li,et al. On Restart Procedures for the Conjugate Gradient Method , 2004, Numerical Algorithms.
[52] Satoshi Matsuoka,et al. Exploration of Lossy Compression for Application-Level Checkpoint/Restart , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.