Computing the Expected Makespan of Task Graphs in the Presence of Silent Errors
暂无分享,去创建一个
[1] Eli Upfal,et al. Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .
[2] Bharadwaj Veeravalli,et al. Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[3] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[4] Michael Pinedo,et al. Scheduling: Theory, Algorithms, and Systems , 1994 .
[5] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[6] Rolf H. Möhring,et al. Scheduling under Uncertainty: Bounding the Makespan Distribution , 2001, Computational Discrete Mathematics.
[7] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[8] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[9] Jean-Charles Billaut,et al. Introduction to scheduling , 2002 .
[10] Padma Raghavan,et al. Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.
[11] J. Scott Provan,et al. The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..
[12] Franck Cappello,et al. Detecting silent data corruption through data dynamic monitoring for scientific applications , 2014, PPoPP '14.
[13] Richard M. Van Slyke,et al. Letter to the Editor---Monte Carlo Methods and the PERT Problem , 1963 .
[14] Saurabh Gupta,et al. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[15] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[16] Franck Cappello,et al. Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation , 2015, 2015 IEEE International Conference on Cluster Computing.
[17] Pierre N. Robillard,et al. The Completion Time of PERT Networks , 1977, Oper. Res..
[18] Jaeyoung Choi,et al. Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..
[19] Bin Nie,et al. A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[20] Radu Prodan,et al. Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.
[21] Van Slyke,et al. MONTE CARLO METHODS AND THE PERT PROBLEM , 1963 .
[22] Ishfaq Ahmad,et al. Benchmarking and Comparison of the Task Graph Scheduling Algorithms , 1999, J. Parallel Distributed Comput..
[23] Xin-She Yang,et al. Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.
[24] Dakai Zhu,et al. Reliability-aware Dynamic Voltage Scaling for energy-constrained real-time embedded systems , 2008, 2008 IEEE International Conference on Computer Design.
[25] Ping Huang,et al. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[26] Alan Wood,et al. The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.
[27] Bronis R. de Supinski,et al. Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.
[28] Saurabh Gupta,et al. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[29] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[30] Yves Robert,et al. Energy-aware scheduling under reliability and makespan constraints , 2011, 2012 19th International Conference on High Performance Computing.
[31] Frédéric Suter. Scheduling Delta-Critical Tasks in mixed-parallel applications on a national grid , 2007, GRID.
[32] Richard W. Vuduc,et al. Self-stabilizing iterative solvers , 2013, ScalA '13.
[33] Ian J. Taylor,et al. Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..
[34] Leslie G. Valiant,et al. The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..
[35] Giuseppe Caire,et al. The throughput of hybrid-ARQ protocols for the Gaussian collision channel , 2001, IEEE Trans. Inf. Theory.
[36] Gary L. Miller,et al. Geometric mesh partitioning: implementation and experiments , 1995, Proceedings of 9th International Parallel Processing Symposium.
[37] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[38] George Bosilca,et al. Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..
[39] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[40] Eugene L. Lawler,et al. The recognition of Series Parallel digraphs , 1979, SIAM J. Comput..
[41] H. Bodlaender,et al. A Note on the Complexity of Network Reliability Problems , 2004 .
[42] Emmanuel Jeannot,et al. Correlation-Aware Heuristics for Evaluating the Distribution of the Longest Path Length of a DAG with Random Weights , 2016, IEEE Transactions on Parallel and Distributed Systems.
[43] John Shalf,et al. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.
[44] Helmut Alt. Computational Discrete Mathematics: advanced lectures , 2001 .
[45] Jane N. Hagstrom,et al. Computational complexity of PERT problems , 1988, Networks.
[46] Y. Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.
[47] Austin R. Benson,et al. Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..
[48] C. E. Clark. The Greatest of a Finite Set of Random Variables , 1961 .
[49] Qiang Wu,et al. Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[50] Franck Cappello,et al. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.
[51] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[52] Bajis M. Dodin,et al. Bounding the Project Completion Time Distribution in PERT Networks , 1985, Oper. Res..
[53] D. Atkin. OR scheduling algorithms. , 2000, Anesthesiology.
[54] Salim Hariri,et al. Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..
[55] Rami G. Melhem,et al. The effects of energy management on reliability in real-time embedded systems , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..