An Approach for Semiautomatic Locality Optimizations Using OpenMP

The processing power of multicore CPUs increases at a high rate, whereas memory bandwidth is falling behind. Almost all modern processors use multiple cache levels to overcome the penalty of slow main memory; however cache efficiency is directly bound to data locality. This paper studies a possible way to incorporate data locality exposure into the syntax of the parallel programming system OpenMP. We study data locality optimizations on two applications: matrix multiplication and Gaus-Seidel stencil. We show that only small changes to OpenMP are required to expose data locality so a compiler can transform the code. Our notion of tiled loops allows developers to easily describe data locality even at scenarios with non-trivial data dependencies. Furthermore, we describe two optimization techniques. One explicitly uses a form of local memory to prevent conflict cache misses, whereas the second one modifies the wavefront parallel programming pattern with dynamically sized blocks to increase the number of parallel tasks. As an additional contribution we explore the benefit of using multiple levels of tiling.

[1]  P. Altena,et al.  In search of clusters , 2007 .

[2]  Michael Bader,et al.  Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves , 2007, PPAM.

[3]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[4]  Gregory Francis Pfister,et al.  In search of clusters (2nd ed.) , 1998 .

[5]  Guang R. Gao,et al.  Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP , 2009, IWOMP.

[6]  Steven J. Deitz,et al.  High-level Language Support for User-defined Reductions , 2004, The Journal of Supercomputing.

[7]  Gregory F. Pfister,et al.  In Search of Clusters , 1995 .

[8]  Sven-Bodo Scholz On defining application-specific high-level array operations by means of shape-invariant programming facilities , 1999 .

[9]  Keshav Pingali,et al.  Tiling Imperfectly-nested Loop Nests (REVISED) , 2000 .

[10]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[11]  Bronis R. de Supinski,et al.  Evolving OpenMP in an Age of Extreme Parallelism, 5th International Workshop on OpenMP, IWOMP 2009, Dresden, Germany, June 3-5, 2009, Proceedings , 2009, IWOMP.

[12]  Keshav Pingali,et al.  Tiling Imperfectly-nested Loop Nests , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[13]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[14]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.