Adaptive work-stealing with parallelism feedback

We present an adaptive work-stealing thread scheduler, A-Steal, for fork-join multithreaded jobs, like those written using the Cilk multithreaded language or the Hood work-stealing library. The A-Steal algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. A-Steal provides continual parallelism feedback to a job scheduler in the form of processor requests, and the job must adaptits execution to the processors allotted to it. Assuming that the job scheduler never allots any job more processors than requested by thejob's thread scheduler, A-Steal guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors. Our analysis models the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the system environment and the job scheduler's administrative policies. We analyze the performance of A-Steal using "trim analysis," which allows us to prove that our thread scheduler performs poorly on at most a small number of time steps, while exhibiting near-optimal behavior on the vast majority. To be precise, suppose that a job has work T1 and span (critical-path length)T∞. On a machine with P processors, A-Steal completes the job in expected O(T1/P + T∞ + L lg P) time steps, where L is the length of a scheduling quantum and P denotes the O(T∞ + L lg P)-trimmed availability. This quantity is the average of the processor availability over all but the O(T∞ + L lg P)time steps having the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, P « T1/T∞, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly its span.

[1]  Robert D. Blumofe,et al.  Adaptive and Reliable ParallelComputing9 Networks of Workstations , 1997 .

[2]  Maurice Herlihy,et al.  Counting networks , 1994, JACM.

[3]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1995, SPAA '95.

[4]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[5]  Robert D. Blumofe,et al.  Executing multithreaded programs efficiently , 1995 .

[6]  Dror G. Feitelson,et al.  Packing Schemes for Gang Scheduling , 1996, JSSPP.

[7]  Mark Moir,et al.  A dynamic-sized nonblocking work stealing deque , 2006, Distributed Computing.

[8]  Robert D. Blumofe,et al.  The performance of work stealing in multiprogrammed environments (extended abstract) , 1998, SIGMETRICS '98/PERFORMANCE '98.

[9]  Nir Shavit,et al.  Elimination trees and the construction of pools and stacks: preliminary version , 1995, SPAA '95.

[10]  Kenneth C. Sevcik,et al.  Multiprocessor Scheduling for High-Variability Service Time Distributions , 1995, JSSPP.

[11]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[12]  Teunis J. Ott,et al.  Load-balancing heuristics and process behavior , 1986, SIGMETRICS '86/PERFORMANCE '86.

[13]  Udi Manber,et al.  DIB—a distributed implementation of backtracking , 1987, TOPL.

[14]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[15]  Edith Schonberg,et al.  Low-overhead scheduling of nested parallelism , 1991, IBM J. Res. Dev..

[16]  Raj Vaswani,et al.  A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors , 1993, TOCS.

[17]  Thu D. Nguyen,et al.  Using Runtime Measured Workload Characteristics in Parallel Processor Scheduling , 1996, JSSPP.

[18]  Yuxiong He,et al.  An empirical evaluation of work stealing with parallelism feedback , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[19]  Xiaotie Deng,et al.  Non-Clairvoyant Multiprocessor Scheduling of Jobs with Changing Execution Characteristics , 2003, J. Sched..

[20]  Siddhartha Sen,et al.  Dynamic Processor Allocation for Adaptively Parallel Work-Stealing Jobs , 2004 .

[21]  Thu D. Nguyen,et al.  Maximizing speedup through self-tuning of processor allocation , 1996, Proceedings of International Conference on Parallel Processing.

[22]  Tim Brecht,et al.  Using Parallel Program Characteristics in Dynamic Processor Allocation Policies , 1996, Perform. Evaluation.

[23]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[24]  Anoop Gupta,et al.  Process control and scheduling issues for multiprogrammed shared-memory multiprocessors , 1989, SOSP '89.

[25]  Robert D. Blumofe,et al.  Scheduling large-scale parallel computations on networks of workstations , 1994, Proceedings of 3rd IEEE International Symposium on High Performance Distributed Computing.

[26]  Ozalp Babaoglu,et al.  ACM Transactions on Computer Systems , 2007 .

[27]  Xiaotie Deng,et al.  Preemptive Scheduling of Parallel Jobs on Multiprocessors , 1996, SIAM J. Comput..

[28]  Peiyi Tang,et al.  Dynamic Processor Self-Scheduling for General Parallel Nested Loops , 1987, IEEE Trans. Computers.

[29]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[30]  Mor Harchol-Balter The Effect of Heavy-Tailed Job Size Distributions on Computer System Design , 1999 .

[31]  Yuxiong He,et al.  Provably Efficient Two-Level Adaptive Scheduling , 2006, JSSPP.

[32]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[33]  Roger Wattenhofer,et al.  The counting pyramid: an adaptive distributed counting scheme , 2004, J. Parallel Distributed Comput..

[34]  Kasper Østerbye,et al.  A Framework for Discrete Event Modelling & Simulation , 2002 .

[35]  Kenneth C. Sevcik,et al.  Application Scheduling and Processor Allocation in Multiprogrammed Parallel Processing Systems , 1994, Perform. Evaluation.

[36]  Mary K. Vernon,et al.  The performance of multiprogrammed multiprocessor scheduling algorithms , 1990, SIGMETRICS '90.

[37]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[38]  Nir Shavit,et al.  Diffracting trees , 1996, TOCS.

[39]  Satish K. Tripathi,et al.  The Processor Working Set and Its Use in Scheduling Multiprocessor Systems , 1991, IEEE Trans. Software Eng..

[40]  Dror G. Feitelson,et al.  The workload on parallel supercomputers: modeling the characteristics of rigid jobs , 2003, J. Parallel Distributed Comput..

[41]  Jeff Edmonds Scheduling in the dark , 2000, Theor. Comput. Sci..

[42]  David J. Lilja,et al.  Implementing a dynamic processor allocation policy for multiprogrammed parallel applications in the SolarisTM , 2001, Concurr. Comput. Pract. Exp..

[43]  William N. Scherer,et al.  Scalable synchronous queues , 2006, PPoPP '06.

[44]  E BlellochGuy,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999 .

[45]  Dror G. Feitelson,et al.  Job Scheduling in Multiprogrammed Parallel Systems , 1997 .

[46]  Francine Berman,et al.  A model for moldable supercomputer jobs , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[47]  Mary K. Vernon,et al.  Dynamic vs. Static Quantum-Based Parallel Processor Allocation , 1996, JSSPP.

[48]  Amitabh Sinha,et al.  Non-Clairvoyant Scheduling for Minimizing Mean Slowdown , 2003, Algorithmica.

[49]  Nian Gu Competitive Analysis of Dynamic Multiprocessor Allocation Strategies , 1995 .

[50]  Bin Song,et al.  Scheduling Adaptively Parallel Jobs , 1998 .

[51]  Xiaotie Deng,et al.  On Multiprocessor System Scheduling , 1996, SPAA '96.

[52]  Kasper Østerbye,et al.  BetaSIM: A framework for discrete event modelling and simulation , 1998, Simul. Pract. Theory.

[53]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[54]  Richard M. Karp,et al.  A randomized parallel branch-and-bound procedure , 1988, STOC '88.

[55]  Giuseppe Serazzi,et al.  Robust Partitioning Policies of Multiprocessor Systems , 1994, Perform. Evaluation.

[56]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[57]  Giuseppe Serazzi,et al.  Analysis of Non-Work-Conserving Processor Partitioning Policies , 1995, JSSPP.

[58]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[59]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[60]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[61]  Guy E. Blelloch,et al.  The Data Locality of Work Stealing , 2002, SPAA '00.

[62]  Allen B. Downey,et al.  A parallel workload model and its implications for processor allocation , 1996, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[63]  Eli Upfal,et al.  A simple load balancing scheme for task allocation in parallel machines , 1991, SPAA '91.

[64]  Yuxiong He,et al.  Adaptive Scheduling with Parallelism Feedback , 2006, 2007 IEEE International Parallel and Distributed Processing Symposium.

[65]  Kenneth C. Sevcik Characterizations of parallelism in applications and their use in scheduling , 1989, SIGMETRICS '89.

[66]  Robert H. Halstead,et al.  Implementation of multilisp: Lisp on a multiprocessor , 1984, LFP '84.

[67]  Guy E. Blelloch,et al.  Space-efficient scheduling of nested parallelism , 1999, TOPL.

[68]  Charles E. Leiserson,et al.  Space-efficient scheduling of multithreaded computations , 1993, SIAM J. Comput..

[69]  Edward D. Lazowska,et al.  Speedup Versus Efficiency in Parallel Systems , 1989, IEEE Trans. Computers.

[70]  Rajeev Motwani,et al.  Non-clairvoyant scheduling , 1994, SODA '93.

[71]  Mark S. Squillante,et al.  On the Benefits and Limitations of Dynamic Partitioning in Parallel Computer Systems , 1995, JSSPP.

[72]  Mor Harchol-Balter,et al.  Exploiting process lifetime distributions for dynamic load balancing , 1995, SIGMETRICS.

[73]  Robert D. Blumofe,et al.  Hood: A user-level threads library for multiprogrammed multiprocessors , 1998 .

[74]  Eleftherios D. Polychronopoulos,et al.  A Tool to Schedule Parallel Applications on Multiprocessors: The NANOS CPU MANAGER , 2000, JSSPP.

[75]  Nir Shavit,et al.  Non-blocking steal-half work queues , 2002, PODC '02.