activity incidents on distributed computing infrastructures

Distributed computing infrastructures are commonly used through scientic gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quanties incident degrees of workow activities from metrics measuring long-tail eect, application eciency, data transfer issues, and site-specic problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. From their degree, incidents are classied in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. We specically study the long-tail eect issue, and propose a new algorithm to control task replication. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4, consumes up to 26% less resource time than a control execution and properly detects unrecoverable errors.

[1]  Hugues Benoit-Cattin,et al.  Dynamic Partitioning of GATE Monte-Carlo Simulations on EGEE , 2010, Journal of Grid Computing.

[2]  Ritu Garg,et al.  Fault TOLERANCE IN GRID COMPUTING : STATE OF THE ART AND OPEN ISSUES , 2011 .

[3]  Floriano Zini,et al.  Evaluation of an economy-based file replication strategy for a data grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[4]  Kenneth Alan De Jong,et al.  An analysis of the behavior of a class of genetic adaptive systems. , 1975 .

[5]  Frédéric Wagner,et al.  WSCOM: Online Task Scheduling with Data Transfers , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[6]  Emir Imamagic,et al.  Grid infrastructure monitoring system based on Nagios , 2007, GMW '07.

[7]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[8]  Spiros Mancoridis,et al.  On the use of computational geometry to detect software faults at runtime , 2010, ICAC '10.

[9]  Yves Robert,et al.  Scheduling Concurrent Bag-of-Tasks Applications on Heterogeneous Platforms , 2010, IEEE Transactions on Computers.

[10]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[11]  Noel De Palma,et al.  Autonomic management policy specification in Tune , 2008, SAC '08.

[12]  Alexandru Iosup,et al.  Grid Computing Workloads , 2011, IEEE Internet Computing.

[13]  Johan Montagnat,et al.  Multi-infrastructure workflow execution for medical simulation in the Virtual Imaging Platform , 2011 .

[14]  Johan Montagnat,et al.  Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization , 2009, JSSPP.

[15]  Henri Casanova,et al.  Non-clairvoyant Scheduling of Multiple Bag-of-Tasks Applications , 2010, Euro-Par.

[16]  Michèle Sebag,et al.  The Grid Observatory , 2011, CCGRID.

[17]  Michèle Sebag,et al.  Discovering Piecewise Linear Models of Grid Workload , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[18]  J. Mordeson,et al.  On subsystems of a fuzzy finite state machine , 1994 .

[19]  Michèle Sebag,et al.  Adaptively detecting changes in Autonomic Grid Computing , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[20]  Tristan Glatard,et al.  Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[21]  Henri Casanova,et al.  On the Harmfulness of Redundant Batch Requests , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[22]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[23]  Johan Montagnat,et al.  Flexible and Efficient Workflow Deployment of Data-Intensive Applications On Grids With MOTEUR , 2008, Int. J. High Perform. Comput. Appl..

[24]  E. Lanciotti,et al.  DIRAC3 – the new generation of the LHCb grid software , 2009 .

[25]  Albert Y. Zomaya,et al.  A Proactive Non-Cooperative Game-Theoretic Framework for Data Replication in Data Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[26]  Jean-Marc Menaud,et al.  Autonomic virtual resource management for service hosting platforms , 2009, 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing.

[27]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Michèle Sebag,et al.  Towards Non-Stationary Grid Models , 2011, Journal of Grid Computing.