Predicting bounds on queuing delay for batch-scheduled parallel machines

Most space-sharing parallel computers presently operated by high-performance computing centers use batch-queuing systems to manage processor allocation. In many cases, users wishing to use these batch-queued resources have accounts at multiple sites and have the option of choosing at which site or sites to submit a parallel job. In such a situation, the amount of time a user's job will wait in any one batch queue can significantly impact the overall time a user waits from job submission to job completion. In this work, we explore a new method for providing end-users with predictions for the bounds on the queuing delay individual jobs will experience. We evaluate this method using batch scheduler logs for distributed-memory parallel machines that cover a 9-year period at 7 large HPC centers.Our results show that it is possible to predict delay bounds reliably for jobs in different queues, and for jobs requesting different ranges of processor counts. Using this information, scientific application developers can intelligently decide where to submit their parallel codes in order to minimize overall turnaround time.

[1]  Larry Rudolph,et al.  Towards Convergence in Job Schedulers for Parallel Supercomputers , 1996, JSSPP.

[2]  D. Nurmi Model-Based Checkpoint Scheduling for Volatile Resource Environments , 2004 .

[3]  Dror G. Feitelson,et al.  Job Characteristics of a Production Parallel Scientivic Workload on the NASA Ames iPSC/860 , 1995, JSSPP.

[4]  Dror G. Feitelson,et al.  Parallel Job Scheduling under Dynamic Workloads , 2003, JSSPP.

[5]  Francine Berman,et al.  Overview of the Book: Grid Computing – Making the Global Infrastructure a Reality , 2003 .

[6]  David Lifka,et al.  Users guide to the Argonne SP scheduling system , 1995 .

[7]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1972 .

[8]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[9]  Allen B. Downey Predicting queue times on space-sharing parallel computers , 1997, Proceedings 11th International Parallel Processing Symposium.

[10]  Stephen Taylor,et al.  Forecasting Economic Time Series , 1979 .

[11]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[12]  W. Hays Statistical theory. , 1968, Annual review of psychology.

[13]  Larry Rudolph,et al.  Parallel Job Scheduling: Issues and Approaches , 1995, JSSPP.

[14]  Warren Smith,et al.  Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance , 1999, JSSPP.

[15]  Allen B. Downey,et al.  Using Queue Time Predictions for Processor Allocation , 1997, JSSPP.

[16]  Mor Harchol-Balter The Effect of Heavy-Tailed Job Size Distributions on Computer System Design , 1999 .

[17]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[18]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[19]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[20]  Daniel Nurmi,et al.  Quantifying Machine Availability in Networked and Desktop Grid Systems , 2004 .

[21]  D. S. Moore,et al.  The Basic Practice of Statistics , 2001 .

[22]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..