An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration

Effective scheduling strategies to improve response times, throughput, and utilization are an important consideration in large supercomputing environments. Parallel machines in these environments have traditionally used space-sharing strategies to accommodate multiple jobs at the same time by dedicating the nodes to a single job until it completes. This approach, however, can result in low system utilization and large job wait times. This paper discusses three techniques that can be used beyond simple space-sharing to improve the performance of large parallel systems. The first technique we analyze is backfilling, the second is gang-scheduling, and the third is migration. The main contribution of this paper is an analysis of the effects of combining the above techniques. Using extensive simulations based on detailed models of realistic workloads, the benefits of combining the various techniques are shown over a spectrum of performance criteria.

[1]  J. Moreira,et al.  An Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[2]  Dror G. Feitelson,et al.  Packing Schemes for Gang Scheduling , 1996, JSSPP.

[3]  Fang Wang,et al.  Modeling of Workload in MPPs , 1997, JSSPP.

[4]  Dror G. Feitelson,et al.  Improved Utilization and Responsiveness with Gang Scheduling , 1997, JSSPP.

[5]  Liana L. Fong,et al.  An Infrastructure for Efficient Parallel Job Execution in Terascale Computing Environments , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[6]  Horst Langendörfer,et al.  Load balancing and fault tolerance in workstation clusters migrating groups of communicating processes , 1995, OPSR.

[7]  Kuniyasu Suzaki,et al.  Implementing the Combination of Time Sharing and Space Sharing on AP/Linux , 1998, JSSPP.

[8]  Warren Smith,et al.  Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance , 1999, JSSPP.

[9]  Honbo Zhou,et al.  The EASY - LoadLeveler API Project , 1996, JSSPP.

[10]  Richard Wolski,et al.  Time Sharing Massively Parallel Machines , 1995, ICPP.

[11]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[12]  Uwe Schwiegelshohn,et al.  Improving First-Come-First-Serve Job Scheduling by Gang Scheduling , 1998, JSSPP.

[13]  David J. Lilja,et al.  Comparing Processor Allocation Strategies in Multiprogrammed Shared-Memory Multiprocessors , 1998, J. Parallel Distributed Comput..

[14]  Mark S. Squillante,et al.  Extensible resource management for cluster computing , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[15]  Allen B. Downey,et al.  Using Queue Time Predictions for Processor Allocation , 1997, JSSPP.

[16]  L. Rudolph,et al.  Gang scheduling for highly efficient, distributed multiprocessor systems , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[17]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[18]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[19]  Richard P. Brent,et al.  Job Re-pacing for Enhancing the Performance of Gang Scheduling , 1999, JSSPP.

[20]  Uwe Schwiegelshohn,et al.  Theory and Practice in Parallel Job Scheduling , 1997, JSSPP.

[21]  Dror G. Feitelson,et al.  Utilization and Predictability in Scheduling the IBM SP2 with Backfilling , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[22]  Miron Livny,et al.  Managing Checkpoints for Parallel Programs , 1996, JSSPP.

[23]  Jonathan Walpole,et al.  MPVM: A Migration Transparent Version of PVM , 1995, Comput. Syst..

[24]  Helen D. Karatza A simulation-based performance analysis of gang scheduling in a distributed system , 1999, Proceedings 32nd Annual Simulation Symposium.

[25]  Liana L. Fong,et al.  A Gang-Scheduling System for ASCI Blue-Pacific , 1999, HPCN Europe.

[26]  Richard Gibbons,et al.  A Historical Application Profiler for Use by Parallel Schedulers , 1997, JSSPP.