Beyond Do Loops: Data Transfer Generation with Convex Array Regions

Automatic data transfer generation is a critical step for guided or automatic code generation for accelerators using distributed memories. Although good results have been achieved for loop nests, more complex control flows such as switches or while loops are generally not handled. This paper shows how to leverage the convex array regions abstraction to generate data transfers. The scope of this study ranges from inter-procedural analysis in simple loop nests with function calls, to inter-iteration data reuse optimization and arbitrary control flow in loop bodies. Generated transfers are approximated when an exact solution cannot be found. Array regions are also used to extend redundant load store elimination to array variables. The approach has been successfully applied to GPUs and domain-specific hardware accelerators.

[1]  Alain Darte,et al.  Program Analysis and Source-Level Communication Optimizations for High-Level Synthesis , 2011 .

[2]  Serge Guelton,et al.  Building Source-to-Source Compilers for Heterogeneous Targets , 2012 .

[3]  Fabien Coelho,et al.  Static Compilation Analysis for Host-Accelerator Communication Optimization , 2011, LCPC.

[4]  Alain Darte,et al.  Optimizing remote accesses for offloaded kernels: application to high-level synthesis for FPGA , 2013, DATE 2013.

[5]  François Irigoin,et al.  Exact versus Approximate Array Region Analyses , 1996, LCPC.

[6]  Ronan Keryell,et al.  Par4All: From Convex Array Regions to Heterogeneous Computing , 2012, HiPEAC 2012.

[7]  William Pugh,et al.  Nonlinear array dependence analysis , 1994 .

[8]  François Irigoin,et al.  Interprocedural Array Region Analyses , 1996, International Journal of Parallel Programming.

[9]  Benoît Meister,et al.  R-Stream Compiler , 2011, Encyclopedia of Parallel Computing.

[10]  Paul Feautrier,et al.  Direct parallelization of call statements , 1986, SIGPLAN '86.

[11]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[12]  B. Eatrice Creusillet,et al.  Exact vs. Approximate Array Region Analyses , 1996 .

[13]  Mahmut T. Kandemir,et al.  A compiler-based approach for dynamically managing scratch-pad memories in embedded systems , 2004, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[14]  Ronan Keryell,et al.  An innovative compilation tool-chain for embedded multi-core architectures , 2012 .

[15]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[16]  Béatrice Creusillet Array Region Analyses and Applications , 2013 .

[17]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[18]  Fabrice Lemonnier,et al.  Definition and SIMD Implementation of a Multi-Processing Architecture Approach on FPGA , 2008, 2008 Design, Automation and Test in Europe.

[19]  Sven Verdoolaege,et al.  Polyhedral Extraction Tool , 2012 .

[20]  Ronan Keryell,et al.  SESAM/Par4All: a tool for joint exploration of MPSoC architectures and dynamic dataflow code generation , 2012, RAPIDO '12.

[21]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[22]  Cédric Bastoul,et al.  Productivity via Automatic Code Generation for PGAS Platforms with the R-Stream Compiler , 2009 .

[23]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .