Optimizing Work Stealing Communication with Structured Atomic Operations

Applications that rely on sparse or irregular data are often challenging to scale on modern distributed-memory systems. As a result, these systems typically require continuous load balancing in order to maintain efficiency. Work stealing is a common technique to remedy imbalance. In this work we present a strategy for work stealing that reduces the amount of communication required for a steal operation by half. We show that in exchange for a small amount of additional complexity to manage the local queue state we can combine both discovering and claiming work into a single step. Conventionally, work stealing uses a two step process of discovering work and then claiming it. Our system, SWS, provides a mechanism where both processes are performed in a singular communication without the need for multiple synchronization messages. This reduction in communication is possible with the novel application of atomic operations that manipulate a compact representation of task queue metadata. We demonstrate the effectiveness of this strategy using known benchmarks for testing dynamic load balancing systems and for performing unbalanced tree searches. Our results show the reduction in communication reduces task acquisition time and steal time, which in turn improves overall performance on sparse computations.

[1]  Richard F. Barrett,et al.  Scheduling Chapel Tasks with Qthreads on Manycore: A Tale of Two Schedulers , 2017, ROSS@HPDC.

[2]  Olivier Tardieu,et al.  A work-stealing scheduler for X10's task parallelism with suspension , 2012, PPoPP '12.

[3]  Y.-K. Kwok,et al.  Static scheduling algorithms for allocating directed task graphs to multiprocessors , 1999, CSUR.

[4]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[6]  Sriram Krishnamoorthy,et al.  Lifeline-based global load balancing , 2011, PPoPP '11.

[7]  Laxmikant V. Kalé,et al.  Hierarchical Load Balancing for Charm++ Applications on Large Supercomputers , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[8]  Vivek Sarkar,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[9]  Arthur Charguéraud,et al.  Scheduling parallel programs by work stealing with private deques , 2013, PPoPP '13.

[10]  Benoît Meister,et al.  The Open Community Runtime: A runtime system for extreme scale computing , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[11]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Sriram Krishnamoorthy,et al.  Scioto: A Framework for Global-View Task Parallelism , 2008, 2008 37th International Conference on Parallel Processing.

[13]  Robert D. Blumofe,et al.  Adaptive and Reliable ParallelComputing9 Networks of Workstations , 1997 .

[14]  Vivek Sarkar,et al.  Work-First and Help-First Scheduling Policies for Terminally Strict Parallel Programs , 2008 .

[15]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[16]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[17]  D. Brian Larkins,et al.  Accelerated Work Stealing , 2019, ICPP.

[18]  Laxmikant V. Kalé,et al.  A load balancing strategy for prioritized execution of tasks , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[19]  Maged M. Michael,et al.  Idempotent work stealing , 2009, PPoPP '09.

[20]  William J. Knottenbelt,et al.  Parallel multilevel algorithms for hypergraph partitioning , 2008, J. Parallel Distributed Comput..

[21]  Ümit V. Çatalyürek,et al.  Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[22]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[23]  Chau-Wen Tseng,et al.  A message passing benchmark for unbalanced applications , 2008, Simul. Model. Pract. Theory.

[24]  Brian W. Barrett,et al.  The Portals 4.3 Network Programming Interface , 2014 .

[25]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[26]  Vivek Sarkar,et al.  Optimized Distributed Work-Stealing , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).

[27]  Nir Shavit,et al.  Non-blocking steal-half work queues , 2002, PODC '02.

[28]  Michael Lang,et al.  Optimizing load balancing and data-locality with data-aware scheduling , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[29]  Vipin Kumar,et al.  Scalable Load Balancing Techniques for Parallel Computers , 1994, J. Parallel Distributed Comput..