Grid Resource Availability Prediction-Based Scheduling and Task Replication

The frequent and volatile unavailability of volunteer-based Grid computing resources challenges Grid schedulers to make effective job placements. The manner in which host resources become unavailable will have different effects on different jobs, depending on their runtime and their ability to be checkpointed or replicated. A multi-state availability model can help improve scheduling performance by capturing the various ways a resource may be available or unavailable to the Grid. This paper uses a multi-state model and analyzes a machine availability trace in terms of that model. Several prediction techniques then forecast resource transitions into the model’s states. We analyze the accuracy of our predictors, which outperform existing approaches. We also propose and study several classes of schedulers that utilize the predictions, and a method for combining scheduling factors. We characterize the inherent tradeoff between job makespan and the number of evictions due to failure, and demonstrate how our schedulers can navigate this tradeoff under various scenarios. Lastly, we propose job replication techniques, which our schedulers utilize to replicate those jobs that are most likely to fail. Our replication strategies outperform others, as measured by improved makespan and fewer redundant operations. In particular, we define a new metric for replication efficiency, and demonstrate that our multi-state availability predictor can provide information that allows our schedulers to be more efficient than others that blindly replicate all jobs or some static percentage of jobs.

[1]  Cosimo Anglano,et al.  Fault-Tolerant Scheduling for Bag-of-Tasks Grid Applications , 2005, EGC.

[2]  Jon B. Weissman Fault tolerant computing on the grid: what are my options? , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[3]  Michael J. Lewis,et al.  Grid Resource Scheduling with Gossiping Protocols , 2007 .

[4]  Jano I. van Hemert,et al.  Towards optimising distributed data streaming graphs using parallel streams , 2010, HPDC '10.

[5]  Ladislau Bölöni,et al.  A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems , 2001, J. Parallel Distributed Comput..

[6]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[7]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[8]  Andrew A. Chien,et al.  The MicroGrid: a Scientific Tool for Modeling Computational Grids , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[9]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[10]  Diomidis Spinellis,et al.  A survey of peer-to-peer content distribution technologies , 2004, CSUR.

[11]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[12]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[13]  Kenichi Hagihara,et al.  A comparison among grid scheduling algorithms for independent coarse-grained tasks , 2004, 2004 International Symposium on Applications and the Internet Workshops. 2004 Workshops..

[14]  Swapna S. Gokhale,et al.  An efficient method to schedule tandem of real-time tasks in cluster computing with possible processor failures , 2003, Proceedings of the Eighth IEEE Symposium on Computers and Communications. ISCC 2003.

[15]  Andrea C. Arpaci-Dusseau,et al.  The interaction of parallel and sequential workloads on a network of workstations , 1995, SIGMETRICS '95/PERFORMANCE '95.

[16]  Satoshi Matsuoka,et al.  Overview of a performance evaluation system for global computing scheduling algorithms , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[17]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[18]  E. Deelman,et al.  Data replication strategies in grid environments , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[19]  Naveen Sharma,et al.  Toward high performance computing in unconventional computing environments , 2010, HPDC '10.

[20]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[21]  Gilles Fedak,et al.  Resource Availability in Enterprise Desktop Grids , 2006 .

[22]  Rajkumar Buyya,et al.  GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing , 2002, Concurr. Comput. Pract. Exp..

[23]  Lizhe Wang,et al.  Scientific Cloud Computing: Early Definition and Experience , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[24]  Jean-Marc Vincent,et al.  Mining for statistical models of availability in large-scale distributed systems: An empirical study of SETI@home , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[25]  Michael J. Lewis,et al.  Resource Availability Prediction for Improved Grid Scheduling , 2008, 2008 IEEE Fourth International Conference on eScience.

[26]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[27]  Andrew S. Grimshaw,et al.  The core Legion object model , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[28]  Jeffrey K. Hollingsworth,et al.  Unobtrusiveness and efficiency in idle cycle stealing for PC grids , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[29]  Andrew A. Chien,et al.  Henri Casanova , 2022 .

[30]  J.-P. Wang,et al.  Task Allocation for Maximizing Reliability of Distributed Computer Systems , 1992, IEEE Trans. Computers.

[31]  Brian D. Noble,et al.  Improving distributed system performance using machine availability prediction , 2006, PERV.

[32]  Lavanya Ramakrishnan,et al.  Performability modeling for scheduling and fault tolerance strategies for scientific workflows , 2008, HPDC '08.

[33]  Nael B. Abu-Ghazaleh,et al.  Toward Self Organizing Grids. , 2006 .

[34]  Kavitha Ranganathan,et al.  Identifying Dynamic Replication Strategies for a High-Performance Data Grid , 2001, GRID.

[35]  Niraj K. Jha,et al.  Safety and Reliability Driven Task Allocation in Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[36]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[37]  Henri Casanova,et al.  Simgrid: a toolkit for the simulation of application scheduling , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[38]  Brian D. Noble,et al.  Predicting node availability in peer-to-peer networks , 2005, SIGMETRICS '05.

[39]  Haym Hirsh,et al.  Learning to Predict Rare Events in Categorical Time-Series Data , 1998 .

[40]  Satish K. Tripathi,et al.  Static and Dynamic Processor Scheduling Disciplines in Heterogeneous Parallel Architectures , 1995, J. Parallel Distributed Comput..

[41]  Francisco Vilar Brasileiro,et al.  Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids , 2003, Euro-Par.

[42]  Amin Vahdat,et al.  Workload and Failure Characterization on a Large-Scale Federated Testbed , 2003 .

[43]  Gilles Fedak,et al.  The Computational and Storage Potential of Volunteer Computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[44]  Mahesh K. Marina,et al.  Performance of route caching strategies in Dynamic Source Routing , 2001, Proceedings 21st International Conference on Distributed Computing Systems Workshops.

[45]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[46]  Rudolf Eigenmann,et al.  Empirical Studies on the Behavior of Resource Availability in Fine-Grained Cycle Sharing Systems , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[47]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[48]  Andrew A. Chien,et al.  Resource Management for Rapid Application Turnaround on Enterprise Desktop Grids , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[49]  Brian D. Noble,et al.  Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[50]  Joel H. Saltz,et al.  The utility of exploiting idle workstations for parallel computation , 1997, SIGMETRICS '97.

[51]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[52]  Andrew S. Grimshaw,et al.  Failure Prediction in Computational Grids , 2007, 40th Annual Simulation Symposium (ANSS'07).

[53]  Rudolf Eigenmann,et al.  Resource Availability Prediction in Fine-Grained Cycle Sharing Systems , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[54]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[55]  Michael J. Lewis,et al.  Multi-state grid resource availability characterization , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[56]  Francisco Vilar Brasileiro,et al.  Exploiting Replication and Data Reuse to Efficiently Schedule Data-Intensive Applications on Grids , 2004, JSSPP.

[57]  Xiao Qin,et al.  RELIABILITY-DRIVEN SCHEDULING FOR REAL-TIME TASKS WITH PRECEDENCE CONSTRAINTS IN HETEROGENEOUS SYSTEMS* * , 2000 .

[58]  David E. Culler,et al.  TOSSIM: accurate and scalable simulation of entire TinyOS applications , 2003, SenSys '03.

[59]  Atakan Dogan,et al.  Biobjective Scheduling Algorithms for Execution Time?Reliability Trade-off in Heterogeneous Computing Systems , 2005, Comput. J..

[60]  R. Eigenmann,et al.  Resource Failure Prediction in Fine-Grained Cycle Sharing Systems , 2005 .

[61]  Rudolf Eigenmann,et al.  Failure-aware checkpointing in fine-grained cycle sharing systems , 2007, HPDC '07.

[62]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[63]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[64]  Henri Casanova,et al.  A decoupled scheduling approach for Grid application development environments , 2003, J. Parallel Distributed Comput..

[65]  Yaohang Li,et al.  Improving performance via computational replication on a large-scale computational grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[66]  Rudolf Eigenmann,et al.  Prediction of Resource Availability in Fine-Grained Cycle Sharing Systems Empirical Evaluation , 2007, Journal of Grid Computing.

[67]  Henri Casanova,et al.  An Evaluation of Job Scheduling Strategies for Divisible Loads on Grid Platforms , 2006 .

[68]  Dimitrios Skoutas,et al.  Efficient task replication and management for adaptive fault tolerance in Mobile Grid environments , 2007, Future Gener. Comput. Syst..

[69]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[70]  C. Siva Ram Murthy,et al.  Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems , 1997, IEEE Trans. Computers.

[71]  Michael J. Lewis,et al.  Scheduling on the Grid via multi-state resource availability prediction , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[72]  David P. Anderson,et al.  Performance Evaluation of Scheduling Policies for Volunteer Computing , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[73]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[74]  Ian T. Foster,et al.  On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing , 2003, IPTPS.