Managing Failures in a Grid System using FailRank ∗

The objective of Grid computing is to make processing power as accessible and easy to use as electricity and water. The last decade has seen an unprecedented growth in Grid infrastructures which nowadays enables large-scale deployment of applications in the scientific computation domain. One of the main challenges in realizing the full potential of Grids, is to make these systems dependable. The objective of this paper is threefold: Firstly, we would like to acquaint the dependability community with the challenges of realizing a dependable Grid. Secondly, we identify the causes of failures in Grid jobs. Our findings are extrapolated from the experiences we acquired by operating a 72 CPU site of the EGEE Grid, one of the largest Grids in scientific computing, for a time span of two years. Lastly, we describe the FailRank architecture, a novel framework for integrating and ranking information sources that characterize failures in a grid system. We identify challenges and preliminary solutions for a variety of complementary tasks including exploratory data analysis and prediction. Technical Report TR-06-4 Department of Computer Science University of Cyprus September, 2006 ∗This work was supported in part by the European Union under projects EGEE (#IST-2003-508833) and CoreGRID (# IST-2002-004265).

[1]  Rajesh Raman,et al.  Matchmaking: An extensible framework for distributed resource management , 1999, Cluster Computing.

[2]  Dimitrios Gunopulos,et al.  Finding Similar Time Series , 1997, PKDD.

[3]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[4]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[5]  Marios D. Dikaiakos,et al.  Ranking and Performance Exploration of Grid Infrastructures: An Interactive Approach , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[6]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[7]  Dimitrios Gunopulos,et al.  Indexing multi-dimensional time-series with support for multiple distance measures , 2003, KDD '03.

[8]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[9]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[10]  Ian T. Foster,et al.  DiPerF: an automated distributed performance testing framework , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[11]  Ian T. Foster,et al.  Globus Toolkit Version 4: Software for Service-Oriented Systems , 2005, Journal of Computer Science and Technology.

[12]  Marios D. Dikaiakos,et al.  GridBench: a tool for benchmarking grids , 2003, Proceedings. First Latin American Web Congress.

[13]  Angelos D. Keromytis,et al.  Application communities: using monoculture for dependability , 2005 .

[14]  W. Nobnop,et al.  Quality assurance. , 1998, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[15]  Lakshminarayanan Subramanian,et al.  Root Cause Localization in Large Scale Systems , 2005 .

[16]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[17]  Keith Marzullo,et al.  The virtue of dependent failures in multi-site systems , 2005 .

[18]  Henri Casanova,et al.  Benchmark probes for grid assessment , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[19]  William H. Sanders,et al.  A dynamic replica selection algorithm for tolerating timing faults , 2001, 2001 International Conference on Dependable Systems and Networks.