DGSS: A Dependability Guided Job Scheduling System for Grid Environment

Due to the diverse failures and error conditions in grid environments, node unavailability is increasingly becoming severe and poses great challenges to reliable job scheduling in grid environment. Current job management systems mainly exploit fault recovery mechanism to guarantee the completion of jobs, but sacrificing system efficiency. To address the challenges, in this paper, a node TTF (Time To Failure) prediction model and job completion prediction model are designed. Based on these models, the paper proposes a dependability guided job scheduling system, called DGSS, which provides failure avoidance job scheduling. The experimental results validate the improvement in the dependability of job execution and system resources utilization.

[1]  Hai Jin,et al.  An Adaptive Meta-scheduler for Data-Intensive Applications , 2003, GCC.

[2]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[3]  Shanshan Song,et al.  Security-driven heuristics and a fast genetic algorithm for trusted grid job scheduling , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[4]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[5]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[6]  Rajkumar Buyya,et al.  A taxonomy and survey of grid resource management systems for distributed computing , 2002, Softw. Pract. Exp..

[7]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[8]  Yi He,et al.  Reliability driven task scheduling for heterogeneous systems , 2003 .

[9]  Rajkumar Buyya,et al.  A grid service broker for scheduling distributed data-oriented applications on global grids , 2004, MGC '04.

[10]  Subhash Saini,et al.  GridFlow: workflow management for grid computing , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[11]  Francine Berman,et al.  New Grid Scheduling and Rescheduling Methods in the GrADS Project , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..