Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs
暂无分享,去创建一个
[1] Hui Liu,et al. Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.
[2] Jack J. Dongarra,et al. High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors , 2012, ICCS.
[3] Zizhong Chen,et al. Algorithmic Cholesky factorization fault recovery , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[4] Zizhong Chen,et al. Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.
[5] Claus Braun,et al. Efficient on-line fault-tolerance for the preconditioned conjugate gradient method , 2015, 2015 IEEE 21st International On-Line Testing Symposium (IOLTS).
[6] Thomas Hérault,et al. Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.
[7] Zizhong Chen,et al. Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition , 2015, IEEE Transactions on Parallel and Distributed Systems.
[8] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[9] YaoErlin,et al. Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance , 2015 .
[10] George Bosilca,et al. CPU-GPU hybrid bidiagonal reduction with soft error resilience , 2013, ScalA '13.
[11] Padma Raghavan,et al. Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.
[12] Robert A. van de Geijn,et al. Fault-tolerant high-performance matrix multiplication: theory and practice , 2001, 2001 International Conference on Dependable Systems and Networks.
[13] Rakesh Kumar,et al. Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[14] Franck Cappello,et al. An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[15] Feng Gao,et al. Fault tolerant matrix-matrix multiplication: correcting soft errors on-line , 2011, ScalA '11.
[16] Jack J. Dongarra,et al. High Performance Dense Linear System Solver with Soft Error Resilience , 2011, 2011 IEEE International Conference on Cluster Computing.
[17] Y. Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.
[18] Frank Mueller,et al. Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[19] Dingwen Tao,et al. Extending checksum-based ABFT to tolerate soft errors online in iterative methods , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).
[20] Satoshi Matsuoka,et al. Resilience in Exascale Computing (Dagstuhl Seminar 14402) , 2014, Dagstuhl Reports.
[21] Franck Cappello,et al. FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[22] Richard W. Vuduc,et al. Self-stabilizing iterative solvers , 2013, ScalA '13.
[23] Dong Li,et al. Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[24] Yves Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015 .
[25] Thomas Hérault,et al. Composing resilience techniques: ABFT, periodic and incremental checkpointing , 2015, Int. J. Netw. Comput..
[26] Vijay S. Pande,et al. Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[27] Zizhong Chen,et al. Correcting soft errors online in LU factorization , 2013, HPDC '13.
[28] Zizhong Chen,et al. FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines , 2014, HPDC '14.
[29] Zizhong Chen,et al. On-line soft error correction in matrix-matrix multiplication , 2013, J. Comput. Sci..
[30] Giuseppe Di Fatta,et al. Epidemic Fault Tolerance for Extreme-Scale Parallel Computing , 2015, IDCS.
[31] Aurelien Bouteiller. Fault-Tolerant MPI , 2015 .
[32] Franck Cappello,et al. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.
[33] Xin Fu,et al. Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).
[34] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[35] Mingyu Chen,et al. Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance , 2015, Int. J. High Perform. Comput. Appl..
[36] Suku Nair,et al. Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor , 1990, IEEE Trans. Computers.
[37] Thomas Hérault,et al. Algorithm-Based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures and Accuracy , 2015, ACM Trans. Parallel Comput..