Design and Comparison of Resilient Scheduling Heuristics for Parallel Jobs

This paper focuses on the resilient scheduling of parallel jobs on high-performance computing (HPC) platforms to minimize the overall completion time, or makespan. We revisit the classical problem while assuming that jobs are subject to transient or silent errors, and hence may need to be re-executed each time they fail to complete successfully. This work generalizes the classical framework where jobs are known offline and do not fail: in the classical framework, list scheduling that gives priority to longest jobs is known to be a 3-approximation when imposing to use shelves, and a 2-approximation without this restriction. We show that when jobs can fail, using shelves can be arbitrarily bad, but unrestricted list scheduling remains a 2-approximation. The paper focuses on the design of several heuristics, some list-based and some shelf-based, along with different priority rules and backfilling strategies. We assess and compare their performance through an extensive set of simulations, using both synthetic jobs and log traces from the Mira supercomputer.

[1]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[2]  Bo Chen,et al.  Scheduling on identical machines: How good is LPT in an on-line setting? , 1997, Oper. Res. Lett..

[3]  Ronald L. Graham,et al.  Bounds for Multiprocessor Scheduling with Resource Constraints , 1975, SIAM J. Comput..

[4]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[5]  Michael Pinedo,et al.  Scheduling tasks with exponential service times on non-identical processors to minimize various cost functions , 1980, Journal of Applied Probability.

[6]  Michael Pinedo,et al.  Scheduling: Theory, Algorithms, and Systems , 1994 .

[7]  Eric Gaussier,et al.  Online Tuning of EASY-Backfilling using Queue Reordering Policies , 2018, IEEE Transactions on Parallel and Distributed Systems.

[8]  Uwe Schwiegelshohn,et al.  Theory and Practice in Parallel Job Scheduling , 1997, JSSPP.

[9]  Berit Johannes,et al.  Scheduling parallel jobs to minimize the makespan , 2006, J. Sched..

[10]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[11]  Greg N. Frederickson,et al.  Sequencing Tasks with Exponential Service Times to Minimize the Expected Flow Time or Makespan , 1981, JACM.

[12]  Andrzej M. Goscinski,et al.  Evaluating the EASY-backfill job scheduling of static workloads on clusters , 2007, 2007 IEEE International Conference on Cluster Computing.

[13]  Guochuan Zhang,et al.  A note on online strip packing , 2009, J. Comb. Optim..

[14]  Brenda S. Baker,et al.  Shelf Algorithms for Two-Dimensional Packing Problems , 1983, SIAM J. Comput..

[15]  Garrick Staples,et al.  TORQUE resource manager , 2006, SC.

[16]  Philip S. Yu,et al.  Approximate algorithms scheduling parallelizable tasks , 1992, SPAA '92.

[17]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[18]  Santosh Pande,et al.  LADR: low-cost application-level detector for reducing silent output corruptions , 2018, HPDC.

[19]  Klaus Jansen,et al.  A(3/2+ε) approximation algorithm for scheduling moldable and non-moldable parallel tasks , 2012, SPAA '12.

[20]  Keqin Li,et al.  Analysis of the List Scheduling Algorithm for Precedence Constrained Parallel Tasks , 1999, J. Comb. Optim..

[21]  P. Sadayappan,et al.  Characterization of backfilling strategies for parallel job scheduling , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[22]  Y. Robert,et al.  Fault-Tolerance Techniques for High-Performance Computing , 2015, Computer Communications and Networks.

[23]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[24]  Gerhard J. Woeginger,et al.  Shelf Algorithms for On-Line Strip Packing , 1997, Inf. Process. Lett..

[25]  David P. Williamson,et al.  Scheduling Parallel Machines On-Line , 1995, SIAM J. Comput..

[26]  Franck Cappello,et al.  Detecting silent data corruption through data dynamic monitoring for scientific applications , 2014, PPoPP '14.

[27]  Ronald L. Rivest,et al.  Orthogonal Packings in Two Dimensions , 1980, SIAM J. Comput..

[28]  Feng Gao,et al.  Fault tolerant matrix-matrix multiplication: correcting soft errors on-line , 2011, ScalA '11.

[29]  Gerhard J. Woeginger,et al.  On-line Packing and Covering Problems , 1996, Online Algorithms.

[30]  Anja Feldmann,et al.  Optimal On-Line Scheduling of Parallel Jobs with Dependencies , 1998, J. Comb. Optim..

[31]  R. Weber Scheduling jobs by stochastic processing requirements on parallel machines to minimize makespan or flowtime , 1982, Journal of Applied Probability.

[32]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[33]  T. J. O'Gorman The effect of cosmic rays on the soft error rate of a DRAM at ground level , 1994 .

[34]  Robert E. Tarjan,et al.  Performance Bounds for Level-Oriented Two-Dimensional Packing Algorithms , 1980, SIAM J. Comput..

[35]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[36]  Guochuan Zhang,et al.  Strip Packing vs. Bin Packing , 2006, AAIM.

[37]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[38]  Johann Hurink,et al.  Online Algorithm for Parallel Job Scheduling and Strip Packing , 2007, WAOA.

[39]  Uwe Schwiegelshohn,et al.  On an on-line scheduling problem for parallel jobs , 2002, Inf. Process. Lett..

[40]  Ashish Goel,et al.  Stochastic load balancing and related problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[41]  Hans P. Muhlfeld,et al.  Cosmic ray soft error rates of 16-Mb DRAM memory chips , 1998, IEEE J. Solid State Circuits.

[42]  K. Mani Chandy,et al.  Scheduling partially ordered tasks with probabilistic execution times , 1975, SOSP.

[43]  Yuval Rabani,et al.  Allocating bandwidth for bursty connections , 1997, STOC '97.

[44]  Honbo Zhou,et al.  The EASY - LoadLeveler API Project , 1996, JSSPP.

[45]  Anja Feldmann,et al.  Dynamic scheduling on parallel machines , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[46]  Andrea Lodi,et al.  Two-dimensional packing problems: A survey , 2002, Eur. J. Oper. Res..

[47]  Franck Cappello,et al.  Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers , 2016, Euro-Par.