The Data Locality of Work Stealing

AbstractThis paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental validation.As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations Gn whose members perform Θ(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total number of cache misses soars to Ω(n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large, important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that on P processors a multiprocessor execution incurs an expected O (C ⌉m/s;⌈PT∞more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T∈fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel computations using work stealing.} For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided work stealing improves the performance of work stealing up to 80%.

[1]  Jacobo Valdes Ayesta Parsing flowcharts and series-parallel graphs , 1978 .

[2]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[3]  Robert H. Halstead,et al.  Implementation of multilisp: Lisp on a multiprocessor , 1984, LFP '84.

[4]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[5]  Richard M. Karp,et al.  A randomized parallel branch-and-bound procedure , 1988, STOC '88.

[6]  David E. Culler,et al.  Resource requirements of dataflow programs , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[7]  Robert H. Halstead,et al.  Mul-T: a high-performance parallel Lisp , 1989, PLDI '89.

[8]  R. Karp,et al.  Parallel Algorithms for Combinatorial Search Problems , 1989 .

[9]  Thomas E. Anderson,et al.  The performance implications of thread management alternatives for shared-memory multiprocessors , 1989, SIGMETRICS '89.

[10]  David Callahan,et al.  A future-based parallel language for a general-purpose highly-parallel computer , 1990 .

[11]  Anoop Gupta,et al.  COOL: a language for parallel programming , 1990 .

[12]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[13]  Devang Shah,et al.  Implementing Lightweight Threads , 1992, USENIX Summer.

[14]  Richard M. Karp,et al.  Randomized parallel algorithms for backtrack search and branch-and-bound computation , 1993, JACM.

[15]  Anoop Gupta,et al.  Data locality and load balancing in COOL , 1993, PPOPP '93.

[16]  Laurie Hendren,et al.  Early experiences with olden (parallel programming) , 1993 .

[17]  Evangelos P. Markatos,et al.  Locality-based scheduling for shared-memory multiprocessors , 1993 .

[18]  Yanjun Zhang,et al.  The efficiency of randomized parallel backtrack search , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[19]  The SPLASH-2 programs: characterization and methodological considerations , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[20]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1995, SPAA '95.

[21]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[22]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[23]  The Performance Implications of Locality Information Usage in Shared-Memory . . . , 1996 .

[24]  Kai Li,et al.  Thread scheduling for cache locality , 1996, ASPLOS VII.

[25]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[26]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[27]  Guy E. Blelloch,et al.  Pipelining with Futures , 1997, SPAA '97.

[28]  Robert D. Blumofe,et al.  The performance of work stealing in multiprogrammed environments (extended abstract) , 1998, SIGMETRICS '98/PERFORMANCE '98.

[29]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[30]  Robert D. Blumofe,et al.  Hood: A user-level threads library for multiprogrammed multiprocessors , 1998 .

[31]  Andrew A. Chien,et al.  A Hierarchical Load-Balancing Framework for Dynamic Multithreaded Computations , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[32]  Boris Weissman,et al.  Performance counters and state sharing annotations: a unified approach to thread locality , 1998, ASPLOS VIII.

[33]  C. Leiserson,et al.  Scheduling multithreaded computations by work stealing , 1999, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[34]  Charles E. Leiserson,et al.  Efficient Detection of Determinacy Races in Cilk Programs , 1997, SPAA '97.

[35]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999, JACM.

[36]  Girija J. Narlikar,et al.  Scheduling threads for low space requirement and good locality , 1999, SPAA '99.

[37]  W. Frable Online publication , 2002 .