Discovering Job Preemptions in the Open Science Grid

The Open Science Grid(OSG)[9] is a world-wide computing system which facilitates distributed computing for scientific research. It can distribute a computationally intensive job to geo-distributed clusters and process job's tasks in parallel. For compute clusters on the OSG, physical resources may be shared between OSG and cluster's local user-submitted jobs, with local jobs preempting OSG-based ones. As a result, job preemptions occur frequently in OSG, sometimes significantly delaying job completion time. We have collected job data from OSG over a period of more than 80 days. We present an analysis of the data, characterizing the preemption patterns and different types of jobs. Based on observations, we have grouped OSG jobs into 5 categories and analyze the runtime statistics for each category. we further choose different statistical distributions to estimate probability density function of job runtime for different classes.

[1]  Paul Avery,et al.  The Open Science Grid , 2007 .

[2]  Amit Choudhury,et al.  A Simple Derivation of Moments of the Exponentiated Weibull Distribution , 2005 .

[3]  Andrei Tsaregorodtsev,et al.  DIRAC pilot framework and the DIRAC Workload Management System , 2010 .

[4]  Tadashi Maeno,et al.  The ATLAS PanDA Pilot in Operation , 2011 .

[5]  Rajesh Raman,et al.  Matchmaking: distributed resource management for high throughput computing , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[6]  I Sfiligoi Estimating job runtime for CMS analysis jobs , 2014 .

[7]  P. Buncic,et al.  AliEn—ALICE environment on the GRID , 2003 .

[8]  Todd Gamblin,et al.  Machine Learning Predictions of Runtime and IO Traffic on High-End Clusters , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[9]  Miron Livny,et al.  Faults in Large Distributed Systems and What We Can Do About Them , 2005, Euro-Par.

[10]  Eduardo Huedo,et al.  Evaluating the reliability of computational grids from the end user's point of view , 2006, J. Syst. Archit..

[11]  Igor Sfiligoi,et al.  glideinWMS - A generic pilot-based Workload Management System , 2008 .

[12]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[13]  Alexandru Iosup,et al.  Trace-based evaluation of job runtime and queue wait time predictions in grids , 2009, HPDC '09.

[14]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[15]  Mats Rynge,et al.  The OSG open facility: A sharing ecosystem , 2015 .