论文信息 - Modelling Pilot-Job Applications on Production Grids

Modelling Pilot-Job Applications on Production Grids

Pilot-job systems have emerged as a computation paradigm to cope with heterogeneity of production grids, greatly improving fault ratios and latency. Tools like DIANE, WISDOM-II, ToPoS and Condor glideIns are now being widely adopted to conduct large-scale experiments on such platforms. However, a model of pilot-job applications is still lacking, making it difficult to determine submission parameters such as the number of pilots to submit to achieve a given performance level. The variability of production conditions and the heterogeneity of the underlying middleware and infrastructure further complicates this issue. This paper presents a performance model for pilot-job applications running on production grids. Based on a probabilistic modelling, we derive statistics about the number of available pilots along time and the makespan of the application given the number of submitted pilots. Results obtained on a radiotherapy application running on the EGEE production grid show that the model is accurate enough to correctly describe the behavior of the application, setting the basis for further optimization strategies.

Tristan Glatard | Sorina Camarasu-Pop

[1] Johan Montagnat,et al. Probabilistic and dynamic optimization of job partitioning on a grid infrastructure , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[2] Igor Sfiligoi,et al. glideinWMS - A generic pilot-based Workload Management System , 2008 .

[3] Johan Montagnat,et al. Impact of the execution context on Grid job performances , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[4] Charles Loomis,et al. Scheduling for Responsive Grids , 2008, Journal of Grid Computing.

[5] Ewa Deelman,et al. Resource Provisioning Options for Large-Scale Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[6] Cheng-Zhong Xu,et al. Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[7] Sunil Ahn,et al. Improvement of Task Retrieval Performance Using AMGA in a Large-Scale Virtual Screening , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[8] Johan Montagnat,et al. Grid-enabled Virtual Screening Against Malaria , 2006, Journal of Grid Computing.

[9] Tristan Glatard,et al. Optimizing jobs timeouts on clusters and production grids , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[10] Johan Montagnat,et al. Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization , 2009, JSSPP.

[11] David Sarrut,et al. Region-oriented CT image representation for reducing computing time of Monte Carlo simulations. , 2008, Medical physics.

[12] Gilles Fedak,et al. Towards Making BOINC and EGEE Interoperable , 2008, 2008 IEEE Fourth International Conference on eScience.

[13] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[14] Tran Ngoc Minh,et al. Modeling Job Arrival Process with Long Range Dependence and Burstiness Characteristics , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[15] J. Moscicki. Distributed analysis environment for HEP and interdisciplinary applications , 2003 .