Multihybrid job scheduling for fault-tolerant distributed computing in policy-constrained resource networks

Unpredictable fluctuations in resource availability often lead to rescheduling decisions that sacrifice a success rate of job completion in batch job scheduling. To overcome this limitation, we consider the problem of assigning a set of sequential batch jobs with demands to a set of resources with constraints such as heterogeneous rescheduling policies and capabilities. The ultimate goal is to find an optimal allocation such that performance benefits in terms of makespan and utilization are maximized according to the principle of Pareto optimality, while maintaining the job failure rate close to an acceptably low bound. To this end, we formulate a multihybrid policy decision problem (MPDP) on the primary-backup fault tolerance model and theoretically show its NP-completeness. The main contribution is to prove that our multihybrid job scheduling (MJS) scheme confidently guarantees the fault-tolerant performance by adaptively combining jobs and resources with different rescheduling policies in MPDP. Furthermore, we demonstrate that the proposed MJS scheme outperforms the five rescheduling heuristics in solution quality, searching adaptability and time efficiency by conducting a set of extensive simulations under various scheduling conditions.

[1]  Michael J. Lewis,et al.  Grid Resource Availability Prediction-Based Scheduling and Task Replication , 2009, Journal of Grid Computing.

[2]  Mor Harchol-Balter,et al.  Analysis of scheduling policies under correlated job sizes , 2010, Perform. Evaluation.

[3]  William M. Jones Network‐aware selective job checkpoint and migration to enhance co‐allocation in multi‐cluster systems , 2009, Concurr. Comput. Pract. Exp..

[4]  Chan-Hyun Youn,et al.  An integrated approach towards aggressive state-tracking migration for maximizing performance benefit in distributed computing , 2011, Cluster Computing.

[5]  Yevgeni Koucheryavy,et al.  Performance response of wireless channels for quantitatively different loss and arrival statistics , 2010, Perform. Evaluation.

[6]  Albert Y. Zomaya,et al.  Pareto-Optimal Cloud Bursting , 2014, IEEE Transactions on Parallel and Distributed Systems.

[7]  Dick H. J. Epema,et al.  Parallel Workload Modeling with Realistic Characteristics , 2014, IEEE Transactions on Parallel and Distributed Systems.

[8]  Fang-Yie Leu,et al.  Impact of MapReduce Task Re-execution Policy on Job Completion Reliability and Job Completion Time , 2014, 2014 IEEE 28th International Conference on Advanced Information Networking and Applications.

[9]  Guangwen Yang,et al.  Job failures in high performance computing systems: A large-scale empirical study , 2012, Comput. Math. Appl..

[10]  William M. Jones,et al.  Applying semantics to grid middleware , 2009 .

[11]  Ladislau Bölöni,et al.  A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems , 2001, J. Parallel Distributed Comput..

[12]  Kalyanmoy Deb,et al.  Messy Genetic Algorithms: Motivation, Analysis, and First Results , 1989, Complex Syst..

[13]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[14]  David E. Goldberg,et al.  Finite Markov Chain Analysis of Genetic Algorithms , 1987, ICGA.

[15]  Henri Casanova,et al.  Deploying Fault-Tolerance and Task Migration with NetSolve , 1998, PARA.

[16]  Dimitrios Skoutas,et al.  Efficient task replication and management for adaptive fault tolerance in Mobile Grid environments , 2007, Future Gener. Comput. Syst..

[17]  Alexandru Iosup,et al.  The performance of bags-of-tasks in large-scale distributed systems , 2008, HPDC '08.

[18]  Hui Li Realistic Workload Modeling and Its Performance Impacts in Large-Scale eScience Grids , 2010, IEEE Transactions on Parallel and Distributed Systems.

[19]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[20]  Anand Sivasubramaniam,et al.  An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration , 2001, JSSPP.

[21]  Achim Streit A Self-Tuning Job Scheduler Family with Dynamic Policy Switching , 2002, JSSPP.

[22]  Hui Li,et al.  Job Failure Analysis and Its Implications in a Large-Scale Production Grid , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[23]  Denis Trystram,et al.  Complexity Analysis of Checkpoint Scheduling with Variable Costs , 2013, IEEE Transactions on Computers.

[24]  Sagar Dhakal,et al.  Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation , 2010, IEEE Transactions on Parallel and Distributed Systems.

[25]  Henri Casanova,et al.  Deploying fault tolerance and taks migration with NetSolve , 1999, Future Gener. Comput. Syst..

[26]  Rachid Guerraoui,et al.  Software-Based Replication for Fault Tolerance , 1997, Computer.

[27]  Xiaosong Ma,et al.  Accelerating Batch Analytics with Residual Resources from Interactive Clouds , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.