Modelling Pilot-Job Applications on Production Grids

Pilot-job systems have emerged as a computation paradigm to cope with heterogeneity of production grids, greatly improving fault ratios and latency. Tools like DIANE, WISDOM-II, ToPoS and Condor glideIns are now being widely adopted to conduct large-scale experiments on such platforms. However, a model of pilot-job applications is still lacking, making it difficult to determine submission parameters such as the number of pilots to submit to achieve a given performance level. The variability of production conditions and the heterogeneity of the underlying middleware and infrastructure further complicates this issue. This paper presents a performance model for pilot-job applications running on production grids. Based on a probabilistic modelling, we derive statistics about the number of available pilots along time and the makespan of the application given the number of submitted pilots. Results obtained on a radiotherapy application running on the EGEE production grid show that the model is accurate enough to correctly describe the behavior of the application, setting the basis for further optimization strategies.

[1]  Johan Montagnat,et al.  Probabilistic and dynamic optimization of job partitioning on a grid infrastructure , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[2]  Igor Sfiligoi,et al.  glideinWMS - A generic pilot-based Workload Management System , 2008 .

[3]  Johan Montagnat,et al.  Impact of the execution context on Grid job performances , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[4]  Charles Loomis,et al.  Scheduling for Responsive Grids , 2008, Journal of Grid Computing.

[5]  Ewa Deelman,et al.  Resource Provisioning Options for Large-Scale Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[6]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[7]  Sunil Ahn,et al.  Improvement of Task Retrieval Performance Using AMGA in a Large-Scale Virtual Screening , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[8]  Johan Montagnat,et al.  Grid-enabled Virtual Screening Against Malaria , 2006, Journal of Grid Computing.

[9]  Tristan Glatard,et al.  Optimizing jobs timeouts on clusters and production grids , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[10]  Johan Montagnat,et al.  Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization , 2009, JSSPP.

[11]  David Sarrut,et al.  Region-oriented CT image representation for reducing computing time of Monte Carlo simulations. , 2008, Medical physics.

[12]  Gilles Fedak,et al.  Towards Making BOINC and EGEE Interoperable , 2008, 2008 IEEE Fourth International Conference on eScience.

[13]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[14]  Tran Ngoc Minh,et al.  Modeling Job Arrival Process with Long Range Dependence and Burstiness Characteristics , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[15]  J. Moscicki Distributed analysis environment for HEP and interdisciplinary applications , 2003 .