Optimizing jobs timeouts on clusters and production grids

This paper presents a method to optimize the timeout value of computing jobs. It relies on a model of the job execution time that considers the job management system latency through a random variable. It also takes into account a proportion of outliers to model either reliable clusters or production grids characterized by faults causing jobs loss. Job management systems are first studied considering classical distributions. Different behaviors are exhibited, depending on the weight of the tail of the distribution and on the amount of outliers. Experimental results are then shown based on the latency distribution and outlier ratios measured on the EGEE grid infrastructure1. Those results show that using the optimal timeout value provided by our method reduces the impact of outliers and leads to a 1.36 speed-up even for reliable systems without outliers.

[1]  José Niño-Mora Stochastic Scheduling , 2009, Encyclopedia of Optimization.

[2]  Yishay Mansour,et al.  Optimizing TCP Retransmission Timeout , 2005, ICN.

[3]  Ariel Orda,et al.  Optimal retrial and timeout strategies for accessing network resources , 2002, TNET.

[4]  Katinka Wolter,et al.  Analysis of Restart Mechanisms in Software Systems , 2006, IEEE Transactions on Software Engineering.

[5]  Nadia Ben Azzouna Etude des méthodes d'échantillonnage des flux pour la mesure dans les réseaux large bande , 2004 .

[6]  Wei Xie,et al.  Optimal Webserver Session Timeout Settings for Web Users , 2002, Int. CMG Conference.

[7]  Philipp Reinecke,et al.  A Measurement Study of the Interplay Between Application Level Restart and Transport Protocol , 2004, ISAS.

[8]  Hui Li,et al.  Workload Characteristics of a Multi-cluster Supercomputer , 2004, JSSPP.

[9]  Dror G. Feitelson,et al.  Workload Modeling for Performance Evaluation , 2002, Performance.

[10]  Johan Montagnat,et al.  Efficient services composition for grid-enabled data-intensive applications , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[11]  Henri Casanova,et al.  On the Harmfulness of Redundant Batch Requests , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[12]  Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 14-17 May 2007, Rio de Janeiro, Brazil , 2007, CCGRID.

[13]  Mor Harchol-Balter Task assignment with unknown duration , 2002, JACM.

[14]  Johan Montagnat,et al.  Probabilistic and dynamic optimization of job partitioning on a grid infrastructure , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[15]  Massoud Pedram,et al.  Determining the Optimal Timeout Values for a Power-Managed System based on the Theory of Markovian Processes: Offline and Online Algorithms , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[16]  Johan Montagnat,et al.  Generic web service wrapper for efficient embedding of legacy codes in service-based workflows , 2006 .

[17]  Arjan J. C. van Gemund,et al.  Symbolic Performance Estimation Of Speculative Parallel Programs , 2003, Parallel Process. Lett..