Legion: Programming Distributed Heterogeneous Architectures with Logical Regions

This thesis covers the design and implementation of Legion, a new programming model and runtime system for targeting distributed heterogeneous machine architectures. Legion introduces logical regions as a new abstraction for describing the structure and usage of program data. We describe how logical regions provide a mechanism for applications to express important properties of program data, such as locality and independence, that are often ignored by current programming systems. We also show how logical regions allow programmers to scope the usage of program data by different computations. The explicit nature of logical regions makes these properties of programs manifest, allowing many of the challenging burdens of parallel programming, including dependence analysis and data movement, to be off-loaded from the programmer to the programming system. Logical regions also improve the programmability and portability of applications by decoupling the specification of a program from how it is mapped onto a target architecture. Logical regions abstractly describe sets of program data without requiring any specification regarding the placement or layout of data. To control decisions about the placement of computations and data, we introduce a novel mapping interface that gives an application programmatic control over mapping decisions at runtime. Different implementations of the mapper interface can be used to port applications to new architectures and to explore alternative mapping choices. Legion guarantees that the decisions made through the mapping interface are independent of the correctness of the program, thus facilitating easy porting and tuning of applications to new architectures with different performance characteristics. Using the information provided by logical regions, an implementation of Legion

[1]  Eric Darve,et al.  Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[3]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[4]  Henry S. Warren,et al.  Hacker's Delight , 2002 .

[5]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[6]  Adam Welc,et al.  Safe nondeterminism in a deterministic-by-default parallel language , 2011, POPL '11.

[7]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Jesús Labarta,et al.  Handling task dependencies under strided and aliased references , 2010, ICS '10.

[9]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[10]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[11]  Vivek Sarkar,et al.  Habanero-Java: the new adventures of old X10 , 2011, PPPJ.

[12]  Mark F. Adams,et al.  Chombo Software Package for AMR Applications Design Document , 2014 .

[13]  Scott B. Baden,et al.  Asynchronous programming with Tarragon , 2006, SC.

[14]  Charles R. Ferenbaugh The PENNANT Mini-App: Unstructured Mesh Hydrodynamics for Advanced Architectures (U) , 2013 .

[15]  Vivek Sarkar,et al.  Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement , 2009, LCPC.

[16]  Karsten Schwan,et al.  Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community , 2011, Computing in Science & Engineering.

[17]  Jaswinder Pal Singh,et al.  Optimizing Communication Scheduling Using Dataflow Semantics , 2009, 2009 International Conference on Parallel Processing.

[18]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[19]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[20]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[21]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[22]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[23]  Alexander Aiken,et al.  Data representation synthesis , 2011, PLDI '11.

[24]  Pat Hanrahan,et al.  GRAMPS: A programming model for graphics pipelines , 2009, ACM Trans. Graph..

[25]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[26]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[27]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[28]  Monica S. Lam,et al.  The design, implementation, and evaluation of Jade , 1998, TOPL.

[29]  Jan Vitek,et al.  Terra: a multi-stage language for high-performance computing , 2013, PLDI.

[30]  Alexander Aiken,et al.  Singe: leveraging warp specialization for high performance on GPUs , 2014, PPoPP '14.

[31]  Jeffrey Overbey,et al.  A type and effect system for deterministic parallel Java , 2009, OOPSLA 2009.

[32]  Alexander Aiken,et al.  Structure Slicing: Extending Logical Regions with Fields , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Alexander Aiken,et al.  Language support for dynamic, hierarchical data partitioning , 2013, OOPSLA.

[34]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[35]  Marc Shapiro,et al.  CRDTs: Consistency without concurrency control , 2009, ArXiv.

[36]  Ray W. Grout,et al.  Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  David Gay,et al.  Language support for regions , 2001, PLDI '01.

[38]  James Cheney,et al.  Region-based memory management in cyclone , 2002, PLDI '02.

[39]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[40]  Alexander Aiken,et al.  Realm: An event-based low-level runtime for distributed memory architectures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).