On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks

Improving the performance of work-stealing load-balancing algorithms in distributed shared-memory systems is challenging. These algorithms need to overcome high costs of contention among workers, communication and remote data-references between nodes, and their impact on the locality preferences of tasks. Prior research focus on stealing from a victim that best exploits data locality, and on using special deques that minimize the contention between local and remote workers. This work explores the selection of tasks that are favourable for migration across nodes in a distributed memory cluster, a lesser-explored dimension to distributed work-stealing. The selection of tasks is guided by the application-level task locality rather than hardware memory topology as is the norm in the literature. The prototype for the performance evaluation of these ideas is implemented in X10, a realization of the asynchronous partitioned global address space programming model. This evaluation reveals the applicability of this new approach on several real-world applications chosen from the Cowichan and the Lone star suites. On a cluster of 128 processors, the new work-stealing strategy demonstrates a speedup between 12% and 31% over X10's existing scheduler. Moreover, the new strategy does not degrade the performance of any of the applications studied.

[1]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[2]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[4]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[6]  Arthur Charguéraud,et al.  Scheduling parallel programs by work stealing with private deques , 2013, PPoPP '13.

[7]  Quentin L. Meunier,et al.  Hardware/software support for adaptive work-stealing in on-chip multiprocessor , 2010, J. Syst. Archit..

[8]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[9]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA.

[10]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[11]  Sriram Krishnamoorthy,et al.  Lifeline-based global load balancing , 2011, PPoPP '11.

[12]  Laxmikant V. Kalé,et al.  A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems , 2012, 2012 41st International Conference on Parallel Processing.

[13]  Stephen L. Olivier,et al.  Scalable Dynamic Load Balancing Using UPC , 2008, 2008 37th International Conference on Parallel Processing.

[14]  Sriram Krishnamoorthy,et al.  Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing , 2008, 2008 37th International Conference on Parallel Processing.

[15]  Daniel L. Neill,et al.  On the Benefits of Work Stealing in Shared-Memory Multiprocessors , 2022 .

[16]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[17]  Sriram Krishnamoorthy,et al.  Scioto: A Framework for Global-View Task Parallelism , 2008, 2008 37th International Conference on Parallel Processing.

[18]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[19]  José Nelson Amaral,et al.  Using the Cowichan problems to investigate the programmability of X10 programming system , 2011, X10 '11.

[20]  Andrew Lumsdaine,et al.  PFunc: modern task parallelism for modern high performance computing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[21]  Yi Guo,et al.  Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[22]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[23]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[24]  Katherine Yelick,et al.  Hierarchical Work Stealing on Manycore Clusters , 2011 .

[25]  Anoop Gupta,et al.  Data locality and load balancing in COOL , 1993, PPOPP '93.

[26]  Thomas R. Gross,et al.  Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.