On the user-scheduler relationship in high-performance computing

To effectively manage High-Performance Computing (HPC) resources, it is essential to maximize return on the substantial infrastructure investment they entail. One prerequisite to success is the ability of the scheduler and user to productively interact. This work develops criteria for measuring productivity, analyzes several aspects of the user-scheduler relationship via user studies, and develops solutions to some vexing barriers between users and schedulers. The five main contributions of this work are as follows. First, this work quantifies the desires of the user population and represents them as a utility function. This contribution is in four parts: a survey-based study collecting utility data from users of a supercomputer system, augmentation of the Standard Workload Format to enable scheduler research using utility functions, and a model for synthetically generating utility function-augmented workloads. Second, a number of the classic scheduling disciplines are evaluated by their ability to maximize aggregate utility of all users, using the synthetic utility functions. These evaluations show the performance impact of inaccurate runtime estimates, contradicting an oft quoted prior result [55] that inaccuracy of estimates leads to better scheduling. Third, a scheduler optimizing the aggregate utility of all users, using a genetic algorithm heuristic, is demonstrated. This contribution includes two software artifacts: an implementation of the genetic algorithm (GA) scheduler, and a modular, extensible scheduler simulation framework that simulates several classic scheduling disciplines and is interoperable with the Standard Workload Format. Fourth, the ability of users to productively interact with this scheduler by providing an accurate estimate of their resource (run time) needs is examined. This contribution consists of formalizing a frequent casual assertion from the scheduling literature, that users typically "pad" runtime estimates, into an explicit Padding Hypothesis, and then falsifying the hypothesis via a survey-based study of users of a supercomputer system. Specifically, absent an incentive to pad–and including incentives to be accurate–the inaccuracy of runtime estimates only improved from an average of 61% inaccurate to an average of 57% inaccurate. This contribution has implications not only for the proposed genetic algorithm scheduler, but for any scheduler that asks users for an estimate, which currently includes virtually all parallel job schedulers both in production use and proposed in the literature. Fifth, a survey of users of a supercomputer system and associated simulations explore the feasibility of removing one of the defining constraints of the parallel job scheduling problem–the non-preemptability of running jobs. An investigation of users' current checkpointing habits produced a workload labeled with per-job checkpoint information, enabling simulation of a checkpoint-aware GA scheduler that may preempt running jobs as it optimizes aggregate utility. Lifting the non-preemptability constraint improves performance of the GA scheduler by 16% (and 23% compared to classic EASY algorithm), including overhead penalties for job termination and restart.

[1]  Larry Rudolph,et al.  Towards Convergence in Job Schedulers for Parallel Supercomputers , 1996, JSSPP.

[2]  Richard Wolski,et al.  QBETS: queue bounds estimation from time series , 2007, SIGMETRICS '07.

[3]  Francine Berman,et al.  A model for moldable supercomputer jobs , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[4]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[5]  P. Sadayappan,et al.  Characterization of backfilling strategies for parallel job scheduling , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[6]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[7]  R. Buehler,et al.  Planning, personality, and prediction: The role of future focus in optimistic time predictions☆ , 2003 .

[8]  Adi Raveh,et al.  Comparing Logs and Models of Parallel Workloads Using the Co-plot Method , 1999, JSSPP.

[9]  Cynthia Bailey Lee,et al.  Precise and realistic utility functions for user-centric performance analysis of schedulers , 2007, HPDC '07.

[10]  Erich Strohmaier,et al.  Supercomputing: What have we learned from the TOP500 Project? , 2004 .

[11]  Joseph Y.-T. Leung,et al.  Complexity of Scheduling Parallel Task Systems , 1989, SIAM J. Discret. Math..

[12]  Amin Vahdat,et al.  Addressing strategic behavior in a deployed microeconomic resource allocator , 2005, P2PECON '05.

[13]  Robert F. Lucas,et al.  Building the Teraflops/Petabytes Production Supercomputing Center , 1999, Euro-Par.

[14]  Robert L. Henderson,et al.  Job Scheduling Under the Portable Batch System , 1995, JSSPP.

[15]  Richard Wolski,et al.  Predicting bounds on queuing delay for batch-scheduled parallel machines , 2006, PPoPP '06.

[16]  Amin Vahdat,et al.  Evaluating the impact of inaccurate information in utility-based scheduling , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[17]  Chaki Ng,et al.  Mirage: a microeconomic resource allocation system for sensornet testbeds , 2005, The Second IEEE Workshop on Embedded Networked Sensors, 2005. EmNetS-II..

[18]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[19]  Dan Tsafrir,et al.  The Dynamics of Backfilling: Solving the Mystery of Why Increased Inaccuracy May Help , 2006, 2006 IEEE International Symposium on Workload Characterization.

[20]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[21]  A. Rollett,et al.  The Monte Carlo Method , 2004 .

[22]  Bent Flyvbjerg,et al.  Delusions of Success , 2003 .

[23]  Dan Tsafrir,et al.  Modeling User Runtime Estimates , 2005, JSSPP.

[24]  Francine Berman,et al.  A comprehensive model of the supercomputer workload , 2001 .

[25]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[26]  Phillip Krueger,et al.  ob Scheduling is More Important than Processor Allocation for Hypercube Computers , 1994, IEEE Trans. Parallel Distributed Syst..

[27]  Uwe Schwiegelshohn,et al.  Parallel Job Scheduling - A Status Report , 2004, JSSPP.

[28]  John E. West,et al.  Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy , 2002, JSSPP.

[29]  Andrea C. Arpaci-Dusseau,et al.  The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance , 2002, JSSPP.

[30]  Cynthia Bailey Lee,et al.  On the User–Scheduler Dialogue: Studies of User-Provided Runtime Estimates and Utility Functions , 2006, Int. J. High Perform. Comput. Appl..

[31]  James Patton Jones,et al.  Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization , 1999, JSSPP.

[32]  Evgenia Smirni,et al.  Multiple-queue backfilling scheduling with priorities and reservations for parallel systems , 2002, PERV.

[33]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[34]  F. B. Berlin Seymour Cray, 1925-1996 [In Memoriam] , 1996 .

[35]  Dmitry N. Zotkin,et al.  Job-length estimation and performance in backfilling schedulers , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[36]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[37]  Warren Smith,et al.  Benchmarks and Standards for the Evaluation of Parallel Job Schedulers , 1999, JSSPP.

[38]  Phil Andrews,et al.  Co-scheduling with User-Settable Reservations , 2005, JSSPP.

[39]  Francine Berman,et al.  Adaptive Selection of Partition Size for Supercomputer Requests , 2000, JSSPP.

[40]  Larry Rudolph,et al.  Parallel Job Scheduling: Issues and Approaches , 1995, JSSPP.

[41]  Uwe Schwiegelshohn,et al.  Theory and Practice in Parallel Job Scheduling , 1997, JSSPP.

[42]  Guido Van Rossum,et al.  Python Tutorial , 1999 .

[43]  Richard Wolski,et al.  Probabilistic advanced reservations for batch-scheduled parallel machines , 2008, PPoPP.

[44]  Dan Tsafrir,et al.  Session-Based, Estimation-less, and Information-less Runtime Prediction Algorithms for Parallel and Grid Job Scheduling , 2006 .

[45]  Tad Hogg,et al.  Spawn: A Distributed Computational Economy , 1992, IEEE Trans. Software Eng..

[46]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[47]  Allan Snavely,et al.  Symbiotic Space-Sharing on SDSC's DataStar System , 2006, JSSPP.

[48]  Allan Snavely,et al.  When Jobs Play Nice: The Case For Symbiotic Space-Sharing , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[49]  R. Buyya,et al.  Market-Oriented Grid and Utility Computing , 2009 .

[50]  Richard Wolski,et al.  G-commerce: market formulations controlling resource allocation on the computational grid , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[51]  Alvin AuYoung,et al.  Service contracts and aggregate utility functions , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[52]  David E. Culler,et al.  Market-based cluster resource management , 2001 .

[53]  Stephen A. Jarvis,et al.  Hybrid Performance-Oriented Scheduling of Moldable Jobs with QoS Demands in Multiclusters and Grids , 2004, GCC.

[54]  Berit Johannes,et al.  Scheduling parallel jobs to minimize the makespan , 2006, J. Sched..

[55]  Cynthia Bailey Lee,et al.  Are User Runtime Estimates Inherently Inaccurate? , 2004, JSSPP.

[56]  David E. Irwin,et al.  Balancing risk and reward in a market-based task service , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[57]  Dror G. Feitelson,et al.  Pitfalls in Parallel Job Scheduling Evaluation , 2005, JSSPP.

[58]  Michael P. Rogers Python Tutorial , 2009 .

[59]  Li Zhang,et al.  Tycoon: An implementation of a distributed, market-based resource allocation system , 2004, Multiagent Grid Syst..

[60]  Hussein M. Abdel-Wahab,et al.  A Microeconomic Scheduler for Parallel Computers , 1995, JSSPP.

[61]  Meeta Sharma Gupta,et al.  Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[62]  David E. Culler,et al.  User-Centric Performance Analysis of Market-Based Cluster Batch Schedulers , 2002, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).

[63]  Charng-Da Lu,et al.  Reliability challenges in large systems , 2006, Future Gener. Comput. Syst..

[64]  David Abramson,et al.  Economic models for resource management and scheduling in Grid computing , 2002, Concurr. Comput. Pract. Exp..

[65]  P. Sadayappan,et al.  Robust scheduling of moldable parallel jobs , 2004, Int. J. High Perform. Comput. Netw..

[66]  Robert B. Wilson,et al.  Research Paper Series Graduate School of Business Stanford University Architecture of Power Markets Architecture of Power Markets 1 , 2022 .

[67]  Honbo Zhou,et al.  The EASY - LoadLeveler API Project , 1996, JSSPP.

[68]  Evgenia Smirni,et al.  Multiple-Queue Backfilling Scheduling with Priorities and Reservations for Parallel Systems , 2002, JSSPP.

[69]  Dror G. Feitelson On the Interpretation of Top500 Data , 1999, Int. J. High Perform. Comput. Appl..

[70]  Ibm Redbooks,et al.  Workload Management With Loadleveler , 2001 .

[71]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[72]  Peter J. Keleher,et al.  Randomization, Speculation, and Adaptation in Batch Schedulers , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[73]  Francine Berman,et al.  Using Moldability to Improve the Performance of Supercomputer Jobs , 2002, J. Parallel Distributed Comput..

[74]  Hans Werner Meuer The TOP500 Project: Looking Back Over 15 Years of Supercomputing Experience , 2008, Informatik-Spektrum.

[75]  David H. Bailey A high-performance fast Fourier transform algorithm for the Cray-2 , 2004, The Journal of Supercomputing.

[76]  N. Fenton The Personal Interview , 1934 .

[77]  Richard Wolski,et al.  Eliciting honest value information in a batch-queue environment , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[78]  Paul Messina The concurrent supercomputing consortium: Year 1 , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[79]  Gordon Bell,et al.  What's next in high-performance computing? , 2002, CACM.

[80]  M TullsenDean,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[81]  Jack Dongarra,et al.  Biannual Top-500 Computer Lists Track Changing Environments for Scientific Computing from the First Vector Machines to Today's Cluster-based Systems , 2001 .

[82]  Yoav Shoham,et al.  Truth revelation in approximately efficient combinatorial auctions , 2002, EC '99.

[83]  P. Taylor The San Diego Supercomputer Center , 1994, IEEE Computational Science and Engineering.

[84]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[85]  Laura Carrington,et al.  Applying an Automated Framework to Produce Accurate Blind Performance Predictions of Full-Scale HPC Applications , 2004 .

[86]  Allan Snavely,et al.  User-guided symbiotic space-sharing of real workloads , 2006, ICS '06.

[87]  N. Metropolis,et al.  The Monte Carlo method. , 1949 .