Tiling optimizations for stencil computations

This thesis studies the techniques of tiling optimizations for stencil programs. Traditionally, research on tiling optimizations mainly focuses on tessellating tiling, atomic tiles and regular tile shapes. This thesis studies several novel tiling techniques which are out of the scope of traditional research. In order to represent a general tiling scheme uniformly, a unified tiling representation framework is introduced. With the unified tiling representation, three tiling techniques are studied. The first tiling technique is Hierarchical Overlapped Tiling, based on the idea of reducing communication overhead by introducing redundant computations. Hierarchical Overlapped Tiling also applies the idea of hierarchical tiling to take advantage of hardware hierarchy, so that the additional overhead introduced by redundant computations can be minimized. The second tiling technique is called Conjugate-Trapezoid Tiling, which schedules the computations and communications within a tile in an interleaving way in order to overlap the computation time and communication latency. Conjugate-Trapezoid Tiling forms a pipeline of computations and communications, hence the communication latency can be hidden. Third, this thesis studies the tile shape selection problem for hierarchical tiling. It is concluded that optimal tile shape selection for hierarchical tiling is a multidimension-al, nonlinear, bi-level programming problem. Experimental results show that the irregular tile shapes selected by solving the optimization problem have the potential to outperform intuitive tiling shapes. ii Acknowledgements

[1]  Rudolf Eigenmann,et al.  Experiences in Using Cetus for Source-to-Source Transformations , 2004, LCPC.

[2]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[3]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[4]  Benoît Meister,et al.  A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.

[5]  Ian T. Foster,et al.  Cactus Application: Performance Predictions in Grid Environments , 2001, Euro-Par.

[6]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[7]  Keshav Pingali,et al.  An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.

[8]  Bowen Alpern,et al.  Hierarchical Tiling: A Methodology for High Performance , 1996 .

[9]  Olgierd Wojtasiewicz,et al.  Elements of mathematical logic , 1964 .

[10]  Xing Zhou,et al.  BulkCompactor: Optimized deterministic execution via Conflict-Aware commit of atomic blocks , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[11]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[12]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[14]  Xing Zhou,et al.  Scheduling of stream-based real-time applications for heterogeneous systems , 2011, LCTES '11.

[15]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[16]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[17]  Mahmut T. Kandemir,et al.  On-chip cache hierarchy-aware tile scheduling for multicore machines , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[18]  Mahmut T. Kandemir,et al.  Memory system optimization of embedded software , 2003, Proc. IEEE.

[19]  Keshav Pingali,et al.  Tiling Imperfectly-nested Loop Nests , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[20]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[21]  Jingling Xue,et al.  Reuse-Driven Tiling for Improving Data Locality , 1998, International Journal of Parallel Programming.

[22]  J. Ramanujam,et al.  Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[23]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[24]  Wenguang Chen,et al.  Cache Sharing Management for Performance Fairness in Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[25]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[26]  Larry Carter,et al.  Selecting tile shape for minimal execution time , 1999, SPAA '99.

[27]  Basilio B. Fraguela,et al.  The Hierarchically Tiled Arrays programming approach , 2004, LCR.

[28]  Ganesh Bikshandi,et al.  Parallel Programming With Hierarchically Tiled Arrays , 2007 .

[29]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[30]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[31]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[32]  Sanjay V. Rajopadhye,et al.  Optimal semi-oblique tiling , 2001, SPAA '01.

[33]  G. Kreisel,et al.  Elements of Mathematical Logic: Model Theory , 1971 .

[34]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[35]  Sanjay V. Rajopadhye,et al.  A Geometric Programming Framework for Optimal Multi-Level Tiling , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[36]  Duncan H. Lawrie,et al.  On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.

[37]  Mark Alpert Not Just Fun and Games , 1999 .

[38]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[39]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[40]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[41]  Sanjay V. Rajopadhye,et al.  Optimal Orthogonal Tiling of 2-D Iterations , 1997, J. Parallel Distributed Comput..

[42]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[43]  Hiroshi Ohta,et al.  Optimal tile size adjustment in compiling general DOACROSS loop nests , 1995, ICS '95.

[44]  Monica S. Lam,et al.  An affine partitioning algorithm to maximize parallelism and minimize communication , 1999, ICS '99.

[45]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[46]  David G. Wonnacott,et al.  Time Skewing for Parallel Computers , 1999, LCPC.

[47]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[48]  Kevin Skadron,et al.  Compact thermal modeling for temperature-aware design , 2004, Proceedings. 41st Design Automation Conference, 2004..

[49]  Jingling Xue Communication-Minimal Tiling of Uniform Dependence Loops , 1997, J. Parallel Distributed Comput..

[50]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..