A model of pilot-job resource provisioning on production grids

Pilot-job systems emerged as a computation paradigm to cope with the heterogeneity of large-scale production grids, greatly reducing fault ratios and middleware overheads. They are now widely adopted to sustain the computation of scientific applications on such platforms. However, a model of pilot-job systems is still lacking, making it difficult to build realistic experimental setups for their study (e.g. simulators or controlled platforms). The variability of production conditions, background loads and resource characteristics further complicate this issue. This paper presents a model of pilot-job resource provisioning. Based on a probabilistic modeling of pilot submission and registration, the number of pilots registered to the application host and the makespan of a divisible-load application are derived. The model takes into account job failures and it does not make any assumption on the characteristics of the computing resources, on the scheduling algorithm or on the background load. Only a minimally invasive monitoring of the grid is required. The model is evaluated in production conditions, using logs acquired on a pilot-job server deployed in the biomed virtual organization of the European Grid Infrastructure. Experimental results show that the model is able to accurately describe the number of registered pilots along time periods ranging from a few hours to a few days and in different pilot submission conditions.

[1]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[2]  Carl Kesselman,et al.  A provisioning model and its comparison with best-effort for performance-cost optimization in grids , 2007, HPDC '07.

[3]  Johan Montagnat,et al.  Grid-enabled Virtual Screening Against Malaria , 2006, Journal of Grid Computing.

[4]  Sunil Ahn,et al.  Improvement of Task Retrieval Performance Using AMGA in a Large-Scale Virtual Screening , 2008, 2008 Fourth International Conference on Networked Computing and Advanced Information Management.

[5]  Henri Casanova,et al.  SimGrid: A Generic Framework for Large-Scale Distributed Experiments , 2008, Tenth International Conference on Computer Modeling and Simulation (uksim 2008).

[6]  Igor Sfiligoi,et al.  glideinWMS - A generic pilot-based Workload Management System , 2008 .

[7]  Tristan Glatard Description, deployment and optimization of medical image analysis workflows on production grids. (Description, déploiement et optimisation de chaînes de traitements d'analyse d'images médicales sur grilles de production) , 2007 .

[8]  Tristan Glatard,et al.  Modelling Pilot-Job Applications on Production Grids , 2009, Euro-Par Workshops.

[9]  G. Herrera DESCRIPTION , 1949 .

[10]  Gilles Fedak,et al.  Towards Making BOINC and EGEE Interoperable , 2008, 2008 IEEE Fourth International Conference on eScience.

[11]  Michèle Sebag,et al.  Toward autonomic grids: analyzing the job flow with affinity streaming , 2009, KDD.

[12]  J. Moscicki Distributed analysis environment for HEP and interdisciplinary applications , 2003 .

[13]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[14]  Debasish Ghose,et al.  Divisible Load Theory: A New Paradigm for Load Scheduling in Distributed Systems , 2004, Cluster Computing.

[15]  T Maeno,et al.  PanDA: distributed production and distributed analysis system for ATLAS , 2008 .

[16]  Yves Robert,et al.  The master-slave paradigm with heterogeneous processors , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[17]  Tran Ngoc Minh,et al.  Modeling Job Arrival Process with Long Range Dependence and Burstiness Characteristics , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[18]  Zvisinei Sandi DEFINITION , 1961, A Philosopher Looks at Sport.

[19]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[20]  Johan Montagnat,et al.  Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization , 2009, JSSPP.

[21]  Hugues Benoit-Cattin,et al.  Dynamic Partitioning of GATE Monte-Carlo Simulations on EGEE , 2010, Journal of Grid Computing.

[22]  Emmanuel Jeannot,et al.  Modeling Resubmission in Unreliable Grids: The Bottom-Up Approach , 2009, Euro-Par Workshops.

[23]  Eddy Caron,et al.  Definition, modelling and simulation of a grid computing scheduling system for high throughput computing , 2007, Future Gener. Comput. Syst..

[24]  Francine Berman,et al.  Master/slave computing on the Grid , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[25]  Ian Stokes-Rees,et al.  DIRAC: a scalable lightweight architecture for high throughput computing , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.