On Simulation and Design of Parallel-Systems Schedulers: Are We Doing the Right Thing?

It is customary to use open-system trace-driven simulations to evaluate the performance of parallel-system schedulers. As a consequence, all schedulers have evolved to optimize the packing of jobs in the schedule, as a means to improve a number of performance metrics that are conjectured to be correlated with user satisfaction, with the premise that this will result in a higher productivity in reality. We argue that these simulations suffer from severe limitations that lead to suboptimal scheduler designs and to even dismissing potentially good design alternatives. We propose an alternative simulation methodology called site-level simulation, in which the workload for the evaluation is generated dynamically by user models that interact with the system. We present a novel scheduler called CREASY that exploits knowledge on user behavior to directly improve user satisfaction and compare its performance to the original packing-based EASY scheduler. We show that user productivity improves by up to 50 percent under the user-aware design, while according to the conventional metrics, performance may actually degrade.

[1]  David Talby,et al.  What is worth learning from parallel workloads?: a user and session based analysis , 2005, ICS '05.

[2]  Chun-Ying Huang,et al.  Quantifying Skype user satisfaction , 2006, SIGCOMM.

[3]  Dror G. Feitelson,et al.  Locality of sampling and diversity in parallel system workloads , 2007, ICS '07.

[4]  Jens Mache,et al.  Job scheduling for prime time vs. non-prime time , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[5]  P. Sadayappan,et al.  Unfairness Metrics for Space-Sharing Parallel Job Schedulers , 2005, JSSPP.

[6]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[7]  Cynthia Bailey Lee,et al.  On the User–Scheduler Dialogue: Studies of User-Provided Runtime Estimates and Utility Functions , 2006, Int. J. High Perform. Comput. Appl..

[8]  Dror G. Feitelson,et al.  Supporting priorities and improving utilization of the IBM SP scheduler using slack-based backfilling , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[9]  P. Sadayappan,et al.  Selective Reservation Strategies for Backfill Job Scheduling , 2002, JSSPP.

[10]  Dror G. Feitelson,et al.  Backfilling with lookahead to optimize the packing of parallel jobs , 2005, J. Parallel Distributed Comput..

[11]  W. Cirne,et al.  A comprehensive model of the supercomputer workload , 2001, Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538).

[12]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[13]  Dror G. Feitelson,et al.  Uncovering the Effect of System Performance on User Behavior from Traces of Parallel Systems , 2007, 2007 15th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[14]  Allen B. Downey,et al.  A parallel workload model and its implications for processor allocation , 1996, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[15]  Andrew Sears,et al.  Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , 2002, CHI 2002.

[16]  Fang Wang,et al.  Modeling of Workload in MPPs , 1997, JSSPP.

[17]  Evgenia Smirni,et al.  Multiple-Queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems , 2002, JSSPP.

[18]  Wei Tsang Ooi,et al.  SAX : A Tool for Studying Congestion-induced Surfer Behavior , 2006 .

[19]  John E. West,et al.  Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy , 2002, JSSPP.

[20]  Hårek Haugerud,et al.  Simulation of User-Driven Computer Behaviour , 2001, LISA.

[21]  Helmut Hlavacs,et al.  Modeling user behavior: a layered approach , 1999, MASCOTS '99. Proceedings of the Seventh International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[22]  Allan Kuchinsky,et al.  Quality is in the eye of the beholder: meeting users' requirements for Internet quality of service , 2000, CHI.

[23]  P. Herbig,et al.  Quality Is in the Eye of the Beholder , 1994 .

[24]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[25]  Martin F. Arlitt,et al.  Characterizing Web user sessions , 2000, PERV.

[26]  Dror G. Feitelson,et al.  Packing Schemes for Gang Scheduling , 1996, JSSPP.

[27]  Andrea C. Arpaci-Dusseau,et al.  The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance , 2002, JSSPP.

[28]  Dror G. Feitelson,et al.  The workload on parallel supercomputers: modeling the characteristics of rigid jobs , 2003, J. Parallel Distributed Comput..

[29]  Dror G. Feitelson,et al.  Using Site-Level Modeling to Evaluate the Performance of Parallel System Schedulers , 2006, 14th IEEE International Symposium on Modeling, Analysis, and Simulation.