Adaptive Metric-Aware Job Scheduling for Production Supercomputers

Job scheduling is a critical and complex task on large-scale supercomputers where a scheduling policy is expected to fulfill amorphous and sometimes conflicting goals from both users and system owners. Moreover, the effectiveness of a scheduling policy is dependent on workload characteristics which vary from time to time. Thus it is challenging to design a versatile scheduling policy that is effective in all circumstances. To address this issue, we propose an adaptive metric-aware job scheduling strategy. First, we propose metric-aware scheduling which enables the scheduler to balance competing scheduling goals represented by different metrics such as job waiting time, fairness, and system utilization. Second, we enhance the scheduler to adaptively adjust scheduling policies based on feedback information of monitored metrics at runtime. We evaluate our design using real workloads from supercomputer centers and demonstrate that our scheduling mechanism can significantly improve system performance in a balanced, sustainable fashion.

[1]  Zhiling Lan,et al.  Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Dror G. Feitelson,et al.  Backfilling with lookahead to optimize the packing of parallel jobs , 2005, J. Parallel Distributed Comput..

[3]  Susan Coghlan,et al.  Petascale System Management Experiences , 2008, LISA.

[4]  Benjamin Avi-Itzhak,et al.  A resource-allocation queueing fairness measure , 2004, SIGMETRICS '04/Performance '04.

[5]  Anat Rafaeli,et al.  The Effects of Queue Structure on Attitudes , 2002 .

[6]  John E. West,et al.  Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy , 2002, JSSPP.

[7]  Zhiling Lan,et al.  Fault-aware, utility-based job scheduling on Blue, Gene/P systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[8]  James Patton Jones,et al.  Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization , 1999, JSSPP.

[9]  P. Sadayappan,et al.  Job fairness in non-preemptive job scheduling , 2004 .

[10]  Guangwen Yang,et al.  PV-EASY: a strict fairness guaranteed and prediction enabled scheduler in parallel job scheduling , 2010, HPDC '10.

[11]  Anand Sivasubramaniam,et al.  Improving parallel job scheduling by combining gang scheduling and backfilling techniques , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[12]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[13]  Diwakar Krishnamurthy,et al.  Towards automated HPC scheduler configuration tuning , 2011, Concurr. Comput. Pract. Exp..

[14]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[15]  Zhiling Lan,et al.  Reducing Fragmentation on Torus-Connected Supercomputers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[16]  Saad Bani-Mohammad,et al.  A new window-based job scheduling scheme for 2D mesh multicomputers , 2011, Simul. Model. Pract. Theory.

[17]  Achim Streit,et al.  On the comparison of CPLEX-computed job schedules with the self-tuning dynP job scheduler , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[18]  Dan Tsafrir,et al.  A Short Survey of Commercial Cluster Batch Schedulers , 2005 .

[19]  Sang Hyuk Son,et al.  Design and evaluation of a feedback control EDF scheduling algorithm , 1999, Proceedings 20th IEEE Real-Time Systems Symposium (Cat. No.99CB37054).

[20]  Achim Streit A Self-Tuning Job Scheduler Family with Dynamic Policy Switching , 2002, JSSPP.

[21]  P. Sadayappan,et al.  Selective Reservation Strategies for Backfill Job Scheduling , 2002, JSSPP.

[22]  C. V. Ramamoorthy,et al.  Aspects of a Dynamically Adaptive Operating System , 1976, IEEE Transactions on Computers.

[23]  Achim Streit Evaluation of an unfair decider mechanism for the self-tuning dynP job scheduler , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..