Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. Incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Implementation and experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4 and properly detects unrecoverable errors.

[1]  Albert Y. Zomaya,et al.  A Proactive Non-Cooperative Game-Theoretic Framework for Data Replication in Data Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[2]  Johan Montagnat,et al.  Issues and scenarios for self-managing grid middleware , 2010, GMAC '10.

[3]  Floriano Zini,et al.  Evaluation of an economy-based file replication strategy for a data grid , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[4]  Jean-Marc Menaud,et al.  Autonomic virtual resource management for service hosting platforms , 2009, 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing.

[5]  Hugues Benoit-Cattin,et al.  Dynamic Partitioning of GATE Monte-Carlo Simulations on EGEE , 2010, Journal of Grid Computing.

[6]  Jano I. van Hemert,et al.  Special Issue: Portals for life sciences—Providing intuitive access to bioinformatic tools , 2011, Concurr. Comput. Pract. Exp..

[7]  Johan Montagnat,et al.  Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization , 2009, JSSPP.

[8]  J. Mordeson,et al.  On subsystems of a fuzzy finite state machine , 1994 .

[9]  Michèle Sebag,et al.  Adaptively detecting changes in Autonomic Grid Computing , 2010, 2010 11th IEEE/ACM International Conference on Grid Computing.

[10]  Henri Casanova,et al.  Non-clairvoyant Scheduling of Multiple Bag-of-Tasks Applications , 2010, Euro-Par.

[11]  Y. Wu,et al.  PhEDEx high-throughput data transfer management system , 2006 .

[12]  Henri Casanova,et al.  On the Harmfulness of Redundant Batch Requests , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[13]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[14]  Emir Imamagic,et al.  Grid infrastructure monitoring system based on Nagios , 2007, GMW '07.

[15]  Yves Robert,et al.  Scheduling Concurrent Bag-of-Tasks Applications on Heterogeneous Platforms , 2010, IEEE Transactions on Computers.

[16]  Johan Montagnat,et al.  Flexible and Efficient Workflow Deployment of Data-Intensive Applications On Grids With MOTEUR , 2008, Int. J. High Perform. Comput. Appl..

[17]  E. Lanciotti,et al.  DIRAC3 – the new generation of the LHCb grid software , 2009 .

[18]  Kuo-Chan Huang,et al.  Online scheduling of workflow applications in grid environments , 2011, Future Gener. Comput. Syst..

[19]  K. Dejong,et al.  An analysis of the behavior of a class of genetic adaptive systems , 1975 .

[20]  Alexandru Iosup,et al.  Grid Computing Workloads , 2011, IEEE Internet Computing.

[21]  Olivier Poch,et al.  Décrypthon Grid - Grid Resources Dedicated to Neuromuscular Disorders , 2010, HealthGrid.

[22]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Michèle Sebag,et al.  Towards Non-Stationary Grid Models , 2011, Journal of Grid Computing.

[24]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[25]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[26]  Johan Montagnat,et al.  Multi-infrastructure workflow execution for medical simulation in the Virtual Imaging Platform , 2011 .

[27]  Spiros Mancoridis,et al.  On the use of computational geometry to detect software faults at runtime , 2010, ICAC '10.

[28]  Noel De Palma,et al.  Autonomic management policy specification in Tune , 2008, SAC '08.

[29]  Francisco Vilar Brasileiro,et al.  On the efficacy, efficiency and emergent behavior of task replication in large distributed systems , 2007, Parallel Comput..

[30]  Michèle Sebag,et al.  The Grid Observatory , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[31]  Péter Kacsuk,et al.  P‐GRADE portal family for grid infrastructures , 2011, Concurr. Comput. Pract. Exp..