A graph based framework to detect optimal memory layouts for improving data locality

In order to extract high levels of performance from modern parallel architectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilization of the memory hierarchy, compiler-directed locality enhancement techniques are also important. In this paper we propose a locality improvement technique that uses data space (array layout) transformations in contrast to most of the previous work based on iteration space (loop) transformations. In other words, rather than changing the order of loop iterations, our technique modifies the memory layouts of multi-dimensional arrays. In comparison with previous work on data transformations it brings two novelties. First, we formulate the problem on a special graph structure called the layout graph (LG) and use integer linear programming (ILP) methods to determine optimal layouts. Second, in addition to static layout detection, our approach also enables the compiler to determine optimal dynamic layouts; that is, the layouts that can be changed across loop nest boundaries. We believe that this is the first attempt to determine optimal dynamic memory layouts. We also present preliminary experimental results on the SGI Origin 2000 distributed shared memory multiprocessor. Our results so far are encouraging and indicate that the additional compilation time taken by the solver is tolerable.

[1]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[2]  E. Ayguade,et al.  A Novel Approach Towards Automatic Data Distribution , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[3]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[4]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[5]  Vivek Sarkar,et al.  Locality Analysis for Distributed Shared-Memory Multiprocessors , 1996, LCPC.

[6]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[7]  Wei Li,et al.  Compiling for NUMA Parallel Machines , 1993 .

[8]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[9]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[10]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988 .

[11]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[12]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[13]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[14]  Bowen Alpern,et al.  Hierarchical Tiling: A Methodology for High Performance , 1996 .

[15]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988, Wiley interscience series in discrete mathematics and optimization.

[16]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[17]  Mahmut T. Kandemir,et al.  A matrix-based approach to the global locality optimization problem , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[18]  William Pugh,et al.  The Omega Library interface guide , 1995 .

[19]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[20]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[21]  Mahmut T. Kandemir,et al.  A hyperplane based approach for optimizing spatial locality in loop nests , 1998, ICS '98.

[22]  Michael F. P. O'Boyle,et al.  Integrating loop and data transformations for global optimisation , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[23]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.