Locality Optimization of Stencil Applications Using Data Dependency Graphs

This paper proposes tiling techniques based on data dependencies and not in code structure. The work presented here leverages and expands previous work by the authors in the domain of non traditional tiling for parallel applications. The main contributions of this paper are: (1) A formal description of tiling from the point of view of the data produced and not from the source code. (2) A mathematical proof for an optimum tiling in terms of maximum reuse for stencil applications, addressing the disparity between computation power and memory bandwidth for many-core architectures. (3) A description and implementation of our tiling technique for well known stencil applications. (4) Experimental evidence that confirms the effectiveness of the tiling proposed to alleviate the disparity between computation power and memory bandwidth for many-core architectures. Our experiments, performed using one of the first Cyclops-64 many-core chips produced, confirm the effectiveness of our approach to reduce the total number of memory operations of stencil applications as well as the running time of the application.

[1]  Guang R. Gao,et al.  Optimized Dense Matrix Multiplication on a Many-Core Architecture , 2010, Euro-Par.

[2]  Guang R. Gao,et al.  Diamond Tiling: A Tiling Framework for Time-iterated Scientic Applications , 2009 .

[3]  Guang R. Gao,et al.  Mapping the LU decomposition on a many-core architecture: challenges and solutions , 2009, CF '09.

[4]  K. Yee Numerical solution of initial boundary value problems involving maxwell's equations in isotropic media , 1966 .

[5]  Guang R. Gao,et al.  Mapping the FDTD Application to Many-Core Chip Architectures , 2009, 2009 International Conference on Parallel Processing.

[6]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[7]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[8]  Vivek Sarkar,et al.  Phasers: a unified deadlock-free construct for collective and point-to-point synchronization , 2008, ICS '08.

[9]  Paul Feautrier,et al.  Automatic Parallelization in the Polytope Model , 1996, The Data Parallel Programming Model.

[10]  J. Ramanujam,et al.  Tiling Multidimensional Itertion Spaces for Multicomputers , 1992, J. Parallel Distributed Comput..

[11]  Alain Darte,et al.  The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications , 1996 .

[12]  Sanjay V. Rajopadhye Dependence Analysis and Parallelizing Transformations , 2002, The Compiler Design Handbook.

[13]  Domenico Talia,et al.  Euro-Par 2010 - Parallel Processing , 2010, Lecture Notes in Computer Science.

[14]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[15]  Jack Dongarra,et al.  Automatic Blocking of Nested Loops , 1990 .

[16]  Guang R. Gao,et al.  Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture , 2006, 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS'06).

[17]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.