ESTIMATING THE EXECUTION CONTEXT FOR REFINING SUBMISSION STRATEGIES ON PRODUCTION GRIDS

In this paper, we study grid job submission latencies. The latency highly impacts performances on production grids, due to its high values and variations as well as the presence of outliers. It is particularly prejudicial for determining the status and expected duration of jobs. In a previous work, a probabilistic model of the latency is presented that allows to estimate the best timeout value considering a given distribution of jobs latencies. This timeout value is then used in a job resubmission strategy. The purpose of this paper is to evaluate to what extent updating this model with relevant contextual parameters can help to refine the latency estimation. Experiments on the EGEE grid show that the choice of the resource broker or the computing site has a statistically significant influence on the jobs latency. We exploit this contextual information to propose a reliable job submission strategy.

[1]  Michelle Sibilla,et al.  A Contextual GRID Monitoring by a Model Driven Approach , 2006, Advanced Int'l Conference on Telecommunications and Int'l Conference on Internet and Web Applications and Services (AICT-ICIW'06).

[2]  M. Goul,et al.  Autonomic workflow execution in the grid , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[3]  Johan Montagnat,et al.  Impact of the execution context on Grid job performances , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[4]  Eddy Caron,et al.  Diet: A Scalable Toolbox to Build Network Enabled Servers on the Grid , 2006, Int. J. High Perform. Comput. Appl..

[5]  Hui Li,et al.  Workload Characteristics of a Multi-cluster Supercomputer , 2004, JSSPP.

[6]  Dror G. Feitelson,et al.  Workload Modeling for Performance Evaluation , 2002, Performance.

[7]  Johan Montagnat,et al.  Probabilistic and dynamic optimization of job partitioning on a grid infrastructure , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[8]  Katinka Wolter,et al.  Analysis of Restart Mechanisms in Software Systems , 2006, IEEE Transactions on Software Engineering.

[9]  Emmanouel A. Varvarigos,et al.  Profiling Computation Jobs in Grid Systems , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[10]  Tristan Glatard,et al.  Optimizing jobs timeouts on clusters and production grids , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[11]  Ariel Orda,et al.  Optimal retrial and timeout strategies for accessing network resources , 2002, TNET.

[12]  K. Kennedy,et al.  Evaluation of a Workflow Scheduler Using Integrated Performance Modelling and Batch Queue Wait Time Prediction , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[13]  J.M. Schopf,et al.  Stochastic Scheduling , 1999, ACM/IEEE SC 1999 Conference (SC'99).