An improved ant colony optimization algorithm with fault tolerance for job scheduling in grid computing systems

The Grid scheduler, schedules user jobs on the best available resource in terms of resource characteristics by optimizing job execution time. Resource failure in Grid is no longer an exception but a regular occurring event as resources are increasingly being used by the scientific community to solve computationally intensive problems which typically run for days or even months. It is therefore absolutely essential that these long-running applications are able to tolerate failures and avoid re-computations from scratch after resource failure has occurred, to satisfy the user’s Quality of Service (QoS) requirement. Job Scheduling with Fault Tolerance in Grid Computing using Ant Colony Optimization is proposed to ensure that jobs are executed successfully even when resource failure has occurred. The technique employed in this paper, is the use of resource failure rate, as well as checkpoint-based roll back recovery strategy. Check-pointing aims at reducing the amount of work that is lost upon failure of the system by immediately saving the state of the system. A comparison of the proposed approach with an existing Ant Colony Optimization (ACO) algorithm is discussed. The experimental results of the implemented Fault Tolerance scheduling algorithm show that there is an improvement in the user’s QoS requirement over the existing ACO algorithm, which has no fault tolerance integrated in it. The performance evaluation of the two algorithms was measured in terms of the three main scheduling performance metrics: makespan, throughput and average turnaround time.

[1]  Amir Masoud Rahmani,et al.  RFOH: A New Fault Tolerant Job Scheduler in Grid Computing , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[2]  Jie Xu,et al.  Fault Tolerance within a Grid Environment , 2003 .

[3]  Leonardo Fialho de Queiroz Fault Tolerance Configuration for Uncoordinated Checkpoints , 2011 .

[4]  P. Latchoumy,et al.  SURVEY ON FAULT TOLERANCE IN GRID COMPUTING , 2011 .

[5]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[6]  Jawahar Thakur,et al.  Checkpointing and Rollback Recovery Algorithms for Fault Tolerance in MANETs , 2014 .

[7]  Roel Leus,et al.  An exact algorithm for parallel machine scheduling with conflicts , 2015, Journal of Scheduling.

[8]  Kalim Qureshi,et al.  Performance evaluation of fault tolerance techniques in grid computing system , 2010, Comput. Electr. Eng..

[9]  Ezugwu E. Absalom,et al.  Characterization of grid computing resources using measurement-based evaluation , 2016, Multiagent Grid Syst..

[10]  Raju Nedunchezhian,et al.  A hybrid policy for fault tolerant load balancing in grid computing environments , 2012, J. Netw. Comput. Appl..

[11]  Naveed Riaz Ansari,et al.  Fault Tolerance in Distributed Paradigms , 2022 .

[12]  Shafii Muhammad Abdulhamid,et al.  Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithm , 2016, Neural Computing and Applications.

[13]  M. Prakash,et al.  Fault Tolerance-Genetic Algorithm for Grid Task Scheduling using Check Point , 2007, Sixth International Conference on Grid and Cooperative Computing (GCC 2007).

[14]  Ritu Garg,et al.  Fault TOLERANCE IN GRID COMPUTING : STATE OF THE ART AND OPEN ISSUES , 2011 .

[15]  Paul D. Manuel,et al.  Adaptive checkpointing strategy to tolerate faults in economy based grid , 2008, The Journal of Supercomputing.

[16]  Rajkumar Buyya,et al.  GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing , 2002, Concurr. Comput. Pract. Exp..

[17]  Javad Bayrampoor,et al.  A Balanced Scheduling Algorithm with Fault Tolerant and Task Migration based on Primary Static Mapping (PSM) in Grid , 2012 .

[18]  Ezugwu E. Absalom,et al.  Virtual Machine Allocation in Cloud Computing Environment , 2013, Int. J. Cloud Appl. Comput..

[19]  Ian T. Foster The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Euro-Par.

[20]  Ingrid Jansch-Pôrto,et al.  Implementing Rollback-Recovery Coordinated Checkpoints , 2005, ISSADS.

[21]  Mohammed Amoon A fault-tolerant scheduling system for computational grids , 2012, Comput. Electr. Eng..

[22]  Sérgio Ricardo de Souza,et al.  An immune-inspired algorithm for an unrelated parallel machines' scheduling problem with sequence and machine dependent setup-times for makespan minimisation , 2015, Neurocomputing.

[23]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[24]  Simone A. Ludwig,et al.  Swarm Intelligence Approaches for Grid Load Balancing , 2011, Journal of Grid Computing.

[25]  Rajkumar Buyya,et al.  Fault-tolerant Workflow Scheduling using Spot Instances on Clouds , 2014, ICCS.