Pro-active failure handling mechanisms for scheduling in grid computing environments

In this paper, we consider designing pro-active failure handling strategies for grid environments. These strategies estimate the availability of resources in the Grid, and also preemptively calculate the expected long term capacity of the Grid. Using these strategies, we create modified versions of the backfill and replication algorithms to include all three pro-active strategies to ascertain each of their effectiveness in the prevention of job failures during execution. Also, we extend our earlier work on a co-ordinate based allocation strategy. The extended algorithm also shows continual improvement when operating under the same execution environment. In our experiments, we compare these enhanced algorithms to their original forms, and show that pro-active failure handling is able to, in some cases, avoid all job failures during execution. Also, we show that NSA provides the best balance of enhanced throughput and job failures during execution of the algorithms we have considered.

[1]  P. Sadayappan,et al.  Distributed job scheduling on computational Grids using multiple simultaneous requests , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[2]  Pak Chung Wong,et al.  Managing Complex Network Operation with Predictive Analytics , 2008, AAAI Spring Symposium: Technosocial Predictive Analytics.

[3]  N. D. Durie,et al.  Digest of papers , 1976 .

[4]  Richard Wolski,et al.  Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[5]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[6]  David Abramson,et al.  Executing Large Parameter Sweep Applications on a Multi-VO Testbed , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[7]  Ramin Yahyapour,et al.  Design and evaluation of job scheduling strategies for grid computing , 2000, GRID.

[8]  Chong-Sun Hwang,et al.  Volunteer availability based fault tolerant scheduling mechanism in desktop grid computing environment , 2004, Third IEEE International Symposium on Network Computing and Applications, 2004. (NCA 2004). Proceedings..

[9]  Francisco Vilar Brasileiro,et al.  Faults in grids: why are they so bad and what can be done about it? , 2003, Proceedings. First Latin American Web Congress.

[10]  Andrew S. Grimshaw,et al.  Failure Prediction in Computational Grids , 2007, 40th Annual Simulation Symposium (ANSS'07).

[11]  Jemal H. Abawajy Robust Parallel Job Scheduling Infrastructure for Service-Oriented Grid Computing Systems , 2005, ICCSA.

[12]  Ram Chillarege,et al.  Measurement of failure rate in widely distributed software , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  Ramin Yahyapour,et al.  User group-based workload analysis and modelling , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..

[14]  Yaohang Li,et al.  Improving performance via computational replication on a large-scale computational grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[15]  Rudolf Eigenmann,et al.  Prediction of Resource Availability in Fine-Grained Cycle Sharing Systems Empirical Evaluation , 2007, Journal of Grid Computing.

[16]  Bharadwaj Veeravalli,et al.  A multi-dimensional scheduling scheme in a Grid computing environment , 2007, J. Parallel Distributed Comput..

[17]  Xuemin Wang,et al.  Data mapping and the prediction of common cause failure probability , 2005, IEEE Transactions on Reliability.

[18]  Soon Young Jung,et al.  A resource manager for optimal resource selection and fault tolerance service in Grids , 2004 .

[19]  Bharadwaj Veeravalli,et al.  A Co-ordinate Based Resource Allocation Strategy for Grid Environments , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[20]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.