A framework for enhancing data reuse via associative reordering

The freedom to reorder computations involving associative operators has been widely recognized and exploited in designing parallel algorithms and to a more limited extent in optimizing compilers. In this paper, we develop a novel framework utilizing the associativity and commutativity of operations in regular loop computations to enhance register reuse. Stencils represent a particular class of important computations where the optimization framework can be applied to enhance performance. We show how stencil operations can be implemented to better exploit register reuse and reduce load/stores. We develop a multi-dimensional retiming formalism to characterize the space of valid implementations in conjunction with other program transformations. Experimental results demonstrate the effectiveness of the framework on a collection of high-order stencils.

[1]  Keshav Pingali,et al.  Exploiting the commutativity lattice , 2011, PLDI '11.

[2]  Albert Cohen,et al.  Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.

[3]  Frédéric Vivien,et al.  Combining Retiming and Scheduling Techniques for Loop Parallelization and Loop Tiling , 1997, Parallel Process. Lett..

[4]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[5]  Katherine Yelick,et al.  Auto-tuning stencil codes for cache-based multicore platforms , 2009 .

[6]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[7]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[8]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[9]  Paul Feautrier,et al.  Detection of Recurrences in Sequential Programs with Loops , 1993, PARLE.

[10]  Jeffrey W. Banks,et al.  Upwind schemes for the wave equation in second-order form , 2012, J. Comput. Phys..

[11]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[12]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[13]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[14]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[15]  Chun Chen,et al.  Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters , 2012, The Journal of Supercomputing.

[16]  Albert Cohen,et al.  Automatic Correction of Loop Transformations , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[17]  Nathan Clark,et al.  Commutativity analysis for software parallelization: letting program transformations see the big picture , 2009, ASPLOS.

[18]  Edwin Hsing-Mean Sha,et al.  Optimizing DSP flow graphs via schedule-based multidimensional retiming , 1996, IEEE Trans. Signal Process..

[19]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[20]  Yun Zhang,et al.  Commutative set: a language extension for implicit parallel programming , 2011, PLDI '11.

[21]  José María Cela,et al.  Introducing the Semi-stencil Algorithm , 2009, PPAM.

[22]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[23]  Linda G. Shapiro,et al.  Computer and Robot Vision , 1991 .

[24]  Keith D. Cooper,et al.  Value-driven redundancy elimination , 1996 .

[25]  Paul Feautrier,et al.  Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.

[26]  Wei Liu,et al.  Speculative parallelization of partial reduction variables , 2010, CGO '10.

[27]  Uday Bondhugula,et al.  Loop transformations: convexity, pruning and optimization , 2011, POPL '11.

[28]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[29]  Sanjay V. Rajopadhye,et al.  Generation of Efficient Nested Loops from Polyhedra , 2000, International Journal of Parallel Programming.

[30]  Yves Robert,et al.  Circuit Retiming Applied to Decomposed Software Pipelining , 1998, IEEE Trans. Parallel Distributed Syst..

[31]  Soo-Mook Moon,et al.  Rotating Register Allocation for Enhanced Pipeline Scheduling , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[32]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[33]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[34]  Sanjay V. Rajopadhye,et al.  Scan detection and parallelization in "inherently sequential" nested loop programs , 2012, CGO '12.

[35]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[36]  Edwin Hsing-Mean Sha,et al.  Achieving Full Parallelism Using Multidimensional Retiming , 1996, IEEE Trans. Parallel Distributed Syst..

[37]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[38]  Steven J. Deitz,et al.  Eliminating redundancies in sum-of-product array computations , 2001, ICS '01.

[39]  P. Sadayappan,et al.  StVEC: A Vector Instruction Extension for High Performance Stencil Computation , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.