Computing the Expected Makespan of Task Graphs in the Presence of Silent Errors

Applications structured as Directed Acyclic Graphs (DAGs) of tasks correspond to a general model of parallel computation that occurs in many domains, including popular scientific workflows. DAG scheduling has received an enormous amount of attention, and several list-scheduling heuristics have been proposed and shown to be effective in practice. Many of these heuristics make scheduling decisions based on path lengths in the DAG. At large scale, however, compute platforms and thus tasks are subject to various types of failures with no longer negligible probabilities of occurrence. Failures that have recently received increasing attention are "silent errors," which cause a task to produce incorrect results even though it ran to completion. Tolerating silent errors is done by checking the validity of the results and re-executing the task from scratch in case of an invalid result. The execution time of a task then becomes a random variable, and so are path lengths. Unfortunately, computing the expected makespan of a DAG (and equivalently computing expected path lengths in a DAG) is a computationally difficult problem. Consequently, designing effective scheduling heuristics is preconditioned on computing accurate approximations of the expected makespan. In this work we propose an algorithm that computes a first order approximation of the expected makespan of a DAG when tasks are subject to silent errors. We compare our proposed approximation to previously proposed such approximations for three classes of application graphs from the field of numerical linear algebra. Our evaluations quantify approximation error with respect to a ground truth computed via a brute-force Monte Carlo method. We find that our proposed approximation outperforms previously proposed approaches, leading to large reductions in approximation error for low (and realistic) failure rates, while executing much faster.

[1]  Gary L. Miller,et al.  Geometric mesh partitioning: implementation and experiments , 1995, Proceedings of 9th International Parallel Processing Symposium.

[2]  Yves Robert,et al.  Energy-aware scheduling under reliability and makespan constraints , 2011, 2012 19th International Conference on High Performance Computing.

[3]  Emmanuel Jeannot,et al.  Correlation-Aware Heuristics for Evaluating the Distribution of the Longest Path Length of a DAG with Random Weights , 2016, IEEE Transactions on Parallel and Distributed Systems.

[4]  Jean-Charles Billaut,et al.  Introduction to scheduling , 2002 .

[5]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[6]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[7]  Saurabh Gupta,et al.  Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[8]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[9]  Frédéric Suter Scheduling Delta-Critical Tasks in mixed-parallel applications on a national grid , 2007, GRID.

[10]  Van Slyke,et al.  MONTE CARLO METHODS AND THE PERT PROBLEM , 1963 .

[11]  Rolf H. Möhring,et al.  Scheduling under Uncertainty: Bounding the Makespan Distribution , 2001, Computational Discrete Mathematics.

[12]  Dakai Zhu,et al.  Reliability-aware Dynamic Voltage Scaling for energy-constrained real-time embedded systems , 2008, 2008 IEEE International Conference on Computer Design.

[13]  Bajis M. Dodin,et al.  Bounding the Project Completion Time Distribution in PERT Networks , 1985, Oper. Res..

[14]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[15]  Y. Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.

[16]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[17]  Giuseppe Caire,et al.  The throughput of hybrid-ARQ protocols for the Gaussian collision channel , 2001, IEEE Trans. Inf. Theory.

[18]  C. E. Clark The Greatest of a Finite Set of Random Variables , 1961 .

[19]  H. Bodlaender,et al.  A Note on the Complexity of Network Reliability Problems , 2004 .

[20]  Richard M. Van Slyke,et al.  Letter to the Editor---Monte Carlo Methods and the PERT Problem , 1963 .

[21]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[22]  Bharadwaj Veeravalli,et al.  Combined DVFS and mapping exploration for lifetime and soft-error susceptibility improvement in MPSoCs , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[23]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[24]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[25]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[26]  Helmut Alt Computational Discrete Mathematics: advanced lectures , 2001 .

[27]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[28]  Radu Prodan,et al.  Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.

[29]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[30]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[31]  Qiang Wu,et al.  Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[32]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[33]  Pierre N. Robillard,et al.  The Completion Time of PERT Networks , 1977, Oper. Res..

[34]  J. Scott Provan,et al.  The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..

[35]  Ishfaq Ahmad,et al.  Benchmarking and Comparison of the Task Graph Scheduling Algorithms , 1999, J. Parallel Distributed Comput..

[36]  Jane N. Hagstrom,et al.  Computational complexity of PERT problems , 1988, Networks.

[37]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[38]  Bin Nie,et al.  A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[39]  Ping Huang,et al.  Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[40]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[41]  Franck Cappello,et al.  Detecting silent data corruption through data dynamic monitoring for scientific applications , 2014, PPoPP '14.

[42]  Austin R. Benson,et al.  Silent error detection in numerical time-stepping schemes , 2015, Int. J. High Perform. Comput. Appl..

[43]  Rami G. Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[44]  Michael Pinedo,et al.  Scheduling: Theory, Algorithms, and Systems , 1994 .

[45]  Franck Cappello,et al.  Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation , 2015, 2015 IEEE International Conference on Cluster Computing.

[46]  Peter Gluchowski,et al.  F , 1934, The Herodotus Encyclopedia.

[47]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[48]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[49]  Eugene L. Lawler,et al.  The recognition of Series Parallel digraphs , 1979, SIAM J. Comput..

[50]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[51]  Franck Cappello,et al.  Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications , 2015, HPDC.

[52]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[53]  John Shalf,et al.  Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.

[54]  D. Atkin OR scheduling algorithms. , 2000, Anesthesiology.

[55]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..