论文信息 - Tiling as a Durable Abstraction for Parallelism and Data Locality

Tiling as a Durable Abstraction for Parallelism and Data Locality

Tiling as a Durable Abstraction for Parallelism and Data Locality Didem Unat Cy Chan Weiqun Zhang John Bell John Shalf Lawrence Berkeley National Laboratory 1 Cyclotron Rd, Berkeley, California, USA 94720 dunat, cychan, weiqunzhang, jbbell, jshalf @lbl.gov Abstract—Tiling is a useful loop transformation for expressing parallelism and data locality. Automated tiling transformations that preserve data-locality are increasingly important due to hardware trends towards massive parallelism and the increasing costs of data movement relative to the cost of computing. We propose TiDA as a durable tiling abstraction that centralizes parameterized tiling information within array data types with minimal changes to the source code. The data layout information can be used by the compiler and runtime to automatically manage parallelism, optimize data locality, and schedule tasks intelligently. In this paper, we present the design features and early interface of TiDA along with some preliminary results. I. I NTRODUCTION There are two main trends in the computer architecture that legitimately concern application developers. First, exponential increases in raw parallelism has replaced nearly two decades of clock rate improvements in a microprocessor. From now on, applications must rely extensively on explicit fine-grained parallelism as a main source of performance improvement. Second, the energy cost of moving data is not improving as fast as the energy required for computation. In the future data movement is expected to become the leading contribu- tor to power consumption and cost of future machines [1]. Whereas current programming environments were designed to assume modest growth in parallelism, uniform costs for communicating, and that FLOPs are most expensive (often at the expense of data movement), the future of computing hinges on preserving data locality (sometimes at the expense of FLOPs) and minimizing data movement. In order to minimize data movement, applications have to be optimized both for vertical and horizontal data movement. Vertical data movement concerns the management of data through the memory hierarchy from memory to processing units and has to be tuned to increase data reuse in on- chip memory. Horizontal data movement concerns the locality management of non-uniformity in bandwidth and latencies to on-chip memory. The NUMA (non-uniform memory access) issues are already prevalent for on-chip data movement and will be more conspicuous on 1000-core chips, leading to seri- ous performance consequences. To address the programming challenges that result from these trends in computer architec- ture, programming models play a crucial role in abstracting the complexity for programmers. Current programming models assume equal cost for all data accesses and rely on the cache to virtualize data movement, not reflecting reality in the computer architecture. Thus, application developers need a richer interface to express parallelism and data locality requirements of an algorithm. Tiling is a loop transformation that is proven to be useful to exploit parallelism and enhance data locality. Despite the long list of literature on this optimization [2]–[9], there is no standard automated solution to transfer tiling information to the compiler and runtime system. Most current methods rely on static loop transformations (usually in the source-to-source translation or in the compiler intermediate representation) and do not allow the runtime system to be involved in decisions about tiling transformations using dynamic data. The status- quo is inadequate for modern adaptive codes such as Adaptive Mesh Refinement (AMR) where crucial information about op- timizing data locality are only available at runtime and change during execution. We argue that tiling should be decoupled from the loops and elevated to the programming model for better interaction with compiler and runtime system. A tiling formulation supported as a language construct can expose massive degrees of parallelism through domain decomposition because a tile represents an atomic unit of work – thus making it far easier for the runtime to schedule tasks. Automating the scheduling decisions enables the runtime system to hide the complexity of massive growth in on-chip parallelism from the application developers. Moreover, tiles represent the core concept for data locality because vertical locality can be achieved by hierarchically partitioning the domain and selecting the appropriate tile size at each level. Horizontal locality can be achieved by respecting tile topology and co- locating tiles that share data closer to each other when data is mapped to execution units. This formulation naturally allows multi-level parallelism because coarse-grained parallelism can be expressed across tiles and fine-grained parallelism can be introduced in the forms of vectorization and instruction ordering within a tile. Although the immediate application of this approach tar- gets data parallel or bulk synchronous stencil operations, atomic nature of the tiling abstraction also makes amenable to future work on asynchronous runtime systems. We envision a programming model of the future that is neither purely bulk synchronous nor purely asynchronous parallel since neither approach is perfect for every situation. Our vision for a future programming model embeds data parallel units within task containers, where the data parallel unit focuses on expression of hierarchy and topology with the tiling abstraction and the task parallel unit focuses on functional partitioning, tile mapping and scheduling. In this paper, we introduce TiDA as a durable tiling abstraction for data parallelism for the programming model

John Shalf | Didem Unat | Weiqun Zhang | John B. Bell | Cy Chan

[1] Robert W. Numrich,et al. Co-array Fortran for parallel programming , 1998, FORF.

[2] Bradford L. Chamberlain,et al. Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[3] Sriram Krishnamoorthy,et al. Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[4] Zhiyuan Li,et al. New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[5] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[6] Larry Carter,et al. Selecting tile shape for minimal execution time , 1999, SPAA '99.

[7] Katherine A. Yelick,et al. Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[8] J. Ramanujam,et al. Parameterized tiling revisited , 2010, CGO '10.

[9] Sanjay V. Rajopadhye,et al. Parameterized Tiling for Imperfectly Nested Loops , 2009 .

[10] Chau-Wen Tseng,et al. Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11] Sanjay V. Rajopadhye,et al. Parameterized tiled loops for free , 2007, PLDI '07.

[12] Nectarios Koziris,et al. Automatic parallel code generation for tiled nested loops , 2004, SAC '04.

[13] P. Hanrahan,et al. Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[14] Sanjay V. Rajopadhye,et al. Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[15] David A. Padua,et al. Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[16] Alexander Aiken,et al. Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[18] Ken Kennedy,et al. Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[19] Uday Bondhugula,et al. A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[20] Scott B. Baden,et al. Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[21] Chun Chen,et al. Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[22] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.