Computing the Expected Makespan of Task Graphs in the Presence of Silent Errors
暂无分享,去创建一个
[1] Gary L. Miller,et al. Geometric mesh partitioning: implementation and experiments , 1995, Proceedings of 9th International Parallel Processing Symposium.
[2] Yves Robert,et al. Energy-aware scheduling under reliability and makespan constraints , 2011, 2012 19th International Conference on High Performance Computing.
[3] Emmanuel Jeannot,et al. Correlation-Aware Heuristics for Evaluating the Distribution of the Longest Path Length of a DAG with Random Weights , 2016, IEEE Transactions on Parallel and Distributed Systems.
[4] Jean-Charles Billaut,et al. Introduction to scheduling , 2002 .
[5] Bronis R. de Supinski,et al. Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.
[6] Alan Wood,et al. The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.
[7] Saurabh Gupta,et al. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[8] Salim Hariri,et al. Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..
[9] Frédéric Suter. Scheduling Delta-Critical Tasks in mixed-parallel applications on a national grid , 2007, GRID.
[10] Van Slyke,et al. MONTE CARLO METHODS AND THE PERT PROBLEM , 1963 .
[11] Rolf H. Möhring,et al. Scheduling under Uncertainty: Bounding the Makespan Distribution , 2001, Computational Discrete Mathematics.
[12] Dakai Zhu,et al. Reliability-aware Dynamic Voltage Scaling for energy-constrained real-time embedded systems , 2008, 2008 IEEE International Conference on Computer Design.
[13] Bajis M. Dodin,et al. Bounding the Project Completion Time Distribution in PERT Networks , 1985, Oper. Res..
[14] Christian Engelmann,et al. Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.
[15] Y. Robert,et al. Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.
[16] Padma Raghavan,et al. Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.
[17] Giuseppe Caire,et al. The throughput of hybrid-ARQ protocols for the Gaussian collision channel , 2001, IEEE Trans. Inf. Theory.
[18] C. E. Clark. The Greatest of a Finite Set of Random Variables , 1961 .
[19] H. Bodlaender,et al. A Note on the Complexity of Network Reliability Problems , 2004 .
[20] Richard M. Van Slyke,et al. Letter to the Editor---Monte Carlo Methods and the PERT Problem , 1963 .
[21] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[22] Bharadwaj Veeravalli,et al. Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[23] Rolf Riesen,et al. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[24] Eli Upfal,et al. Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .
[25] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[26] Helmut Alt. Computational Discrete Mathematics: advanced lectures , 2001 .
[27] George Bosilca,et al. Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..
[28] Radu Prodan,et al. Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.
[29] Richard W. Vuduc,et al. Self-stabilizing iterative solvers , 2013, ScalA '13.
[30] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[31] Qiang Wu,et al. Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[32] Kurt B. Ferreira,et al. Fault-tolerant iterative methods via selective reliability. , 2011 .
[33] Pierre N. Robillard,et al. The Completion Time of PERT Networks , 1977, Oper. Res..
[34] J. Scott Provan,et al. The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..
[35] Ishfaq Ahmad,et al. Benchmarking and Comparison of the Task Graph Scheduling Algorithms , 1999, J. Parallel Distributed Comput..
[36] Jane N. Hagstrom,et al. Computational complexity of PERT problems , 1988, Networks.
[37] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[38] Bin Nie,et al. A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[39] Ping Huang,et al. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[40] Ian J. Taylor,et al. Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..
[41] Franck Cappello,et al. Detecting silent data corruption through data dynamic monitoring for scientific applications , 2014, PPoPP '14.
[42] Austin R. Benson,et al. Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..
[43] Rami G. Melhem,et al. The effects of energy management on reliability in real-time embedded systems , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..
[44] Michael Pinedo,et al. Scheduling: Theory, Algorithms, and Systems , 1994 .
[45] Franck Cappello,et al. Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation , 2015, 2015 IEEE International Conference on Cluster Computing.
[46] Peter Gluchowski,et al. F , 1934, The Herodotus Encyclopedia.
[47] Laxmikant V. Kalé,et al. ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[48] Saurabh Gupta,et al. Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[49] Eugene L. Lawler,et al. The recognition of Series Parallel digraphs , 1979, SIAM J. Comput..
[50] Leslie G. Valiant,et al. The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..
[51] Franck Cappello,et al. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.
[52] Robert E. Lyons,et al. The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..
[53] John Shalf,et al. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.
[54] D. Atkin. OR scheduling algorithms. , 2000, Anesthesiology.
[55] Jaeyoung Choi,et al. Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..