Using the loop chain abstraction to schedule across loops in existing code

Exposing opportunities for parallelisation while explicitly managing data locality is the primary challenge to porting and optimising computational science simulation codes to improve performance. OpenMP provides mechanisms for expressing parallelism, but it remains the programmer’s responsibility to group computations to improve data locality. The loop chain abstraction, where a summary of data access patterns is included as pragmas associated with parallel loops, provides compilers with sufficient information to automate the parallelism versus data locality trade-off. We present the syntax and semantics of loop chain pragmas for indicating information about loops belonging to the loop chain and specification of a high-level schedule for the loop chain. We show example usage of the pragmas, detail attempts to automate the transformation of a legacy scientific code written with specific language constraints to loop chain codes, describe the compiler implementation for loop chain pragmas, and exhibit performance results for a computational fluid dynamics benchmark.

[1]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[2]  Albert Cohen,et al.  Coarse-Grained Loop Parallelization: Iteration Space Slicing vs Affine Transformations , 2009, ISPDC.

[3]  Albert Cohen,et al.  PrimeTile: A Parametric Multi-Level Tiler for Imperfect Loop Nests , 2009 .

[4]  Joel H. Saltz,et al.  An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications , 2009, IEEE Transactions on Parallel and Distributed Systems.

[5]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[6]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[7]  Jack J. Dongarra,et al.  Analytical modeling and optimization for affinity based thread scheduling on multicore systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[8]  Ken Kennedy,et al.  Optimizing for parallelism and data locality , 1992 .

[9]  David A. Padua,et al.  Task-Parallel versus Data-Parallel Library-Based Programming in Multicore Systems , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[10]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[11]  Vikram K. Narayana,et al.  Reconfiguration and Communication-Aware Task Scheduling for High-Performance Reconfigurable Computing , 2010, TRETS.

[12]  Albert Cohen,et al.  Split tiling for GPUs: automatic parallelization using trapezoidal tiles , 2013, GPGPU@ASPLOS.

[13]  John Shalf,et al.  TiDA: High-Level Programming Abstractions for Data Locality Management , 2016, ISC.

[14]  Catherine Mills Olschanowsky,et al.  A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  William Gropp,et al.  Annotations for Productivity and Performance Portability , 2007 .

[16]  Fabrice Rastello,et al.  Description, Implementation and Evaluation of an Affinity Clause for Task Directives , 2016, IWOMP.

[17]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[18]  Dan Quinlan,et al.  The ROSE Source-to-Source Compiler Infrastructure , 2011 .

[19]  Uday Bondhugula,et al.  Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors , 2009, PPoPP '09.

[20]  Richard W. Vuduc,et al.  Performance evaluation of concurrent collections on high-performance multicore computing systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[21]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[22]  Kevin Skadron,et al.  Exploiting inter-thread temporal locality for chip multithreading , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[23]  Alejandro Duran,et al.  Extending the OpenMP Tasking Model to Allow Dependent Tasks , 2008, IWOMP.

[24]  Robert A. van de Geijn,et al.  SuperMatrix: a multithreaded runtime scheduling system for algorithms-by-blocks , 2008, PPoPP.

[25]  Andrew Stone,et al.  Abstractions to separate concerns in semi-regular grids , 2013, ICS '13.

[26]  Dirk Schmidl,et al.  Data and thread affinity in openmp programs , 2008, MAW '08.

[27]  Scott B. Baden,et al.  Latency Hiding and Performance Tuning with Graph-Based Execution , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[28]  Samuel Williams,et al.  Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[29]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[30]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[31]  Kathleen Knobe,et al.  Concurrent Collections on Distributed Memory Theory Put into Practice , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[32]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[33]  David Parello,et al.  Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies , 2006, International Journal of Parallel Programming.

[34]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[35]  Qing Yi,et al.  POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..

[36]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[37]  William Pugh,et al.  Iteration Space Slicing for Locality , 1999, LCPC.

[38]  Sanjay V. Rajopadhye,et al.  Parameterized loop tiling , 2012, TOPL.

[39]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[40]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.