Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

We present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the OpenStream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63× compared to a state-of-the-art work-stealing scheduler.

[1]  Karine Heydemann,et al.  Aftermath: A graphical tool for performance analysis and debugging of fine-grained task-parallel programs and run-time systems , 2014 .

[2]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[3]  J. Demmel,et al.  Sun Microsystems , 1996 .

[4]  Anoop Gupta,et al.  Data locality and load balancing in COOL , 1993, PPOPP '93.

[5]  Jie Chen,et al.  Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Alexandra Fedorova,et al.  Contention-Aware Scheduling on Multicore Systems , 2010, TOCS.

[7]  Vivien Quéma,et al.  Efficient Workstealing for Multicore Event-Driven Systems , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[8]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[9]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[10]  Robert Tappan Morris,et al.  Multiprocessor Support for Event-Driven Programs , 2003, USENIX Annual Technical Conference, General Track.

[11]  Eli Upfal,et al.  A simple load balancing scheme for task allocation in parallel machines , 1991, SPAA '91.

[12]  Manuel Prieto,et al.  Survey of scheduling techniques for addressing shared resources in multicore processors , 2012, CSUR.

[13]  Katherine Yelick,et al.  Hierarchical Work Stealing on Manycore Clusters , 2011 .

[14]  Jens Palsberg,et al.  Concurrent Collections , 2010, Sci. Program..

[15]  Thierry Gautier,et al.  KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors , 2007, PASCO '07.

[16]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[17]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[18]  Alejandro Duran,et al.  Support for OpenMP tasks in Nanos v4 , 2007, CASCON.

[19]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[20]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[21]  Quan Chen,et al.  CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures , 2012, ICS '12.

[22]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[23]  Brice Goglin,et al.  ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures , 2010, International Journal of Parallel Programming.

[24]  Samuel Thibault,et al.  Building Portable Thread Schedulers for Hierarchical Multiprocessors: The BubbleSched Framework , 2007, Euro-Par.

[25]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[26]  Alejandro Duran,et al.  Evaluation of OpenMP Task Scheduling Strategies , 2008, IWOMP.

[27]  Christoforos E. Kozyrakis,et al.  Locality-aware task management for unstructured parallelism: a quantitative limit study , 2013, SPAA.

[28]  Albert Cohen,et al.  Correct and efficient work-stealing for weak memory models , 2013, PPoPP '13.

[29]  Guy E. Blelloch,et al.  The Data Locality of Work Stealing , 2002, SPAA '00.

[30]  Andrew Brownsword,et al.  Schedule Data, Not Code , 2011 .

[31]  Vivek Sarkar,et al.  Habanero-Java: the new adventures of old X10 , 2011, PPPJ.

[32]  Albert Cohen,et al.  OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs , 2012, TACO.

[33]  Nir Shavit,et al.  Work dealing , 2002, SPAA '02.

[34]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[35]  Jack J. Dongarra,et al.  Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.

[36]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[37]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[38]  Thierry Gautier,et al.  libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms , 2012, IWOMP.