User-Specified and Automatic Data Layout Selection for Portable Performance

This paper describes a new approach to managing array data layouts to optimize performance for scientific codes. Prior research has shown that changing data layouts (e.g., interleaving arrays) can improve performance. However, there have been two major reasons why such optimizations are not widely used: (1) the need to select different layouts for different computing platforms, and (2) the cost of re-writing codes to use to new layouts. We describe a source-to-source translation process that allows us to generate codes with different array interleavings, based on a data layout specification. We used this process to generate 19 different data layouts for an ASC benchmark code (IRSmk) and 32 different data layouts for the DARPA UHPC challenge application (LULESH). Performance results for multicore versions of the benchmarks with different layouts show significant benefits on four computing platforms (IBM POWER7, AMD APU, Intel Sandybridge, IBM BG/Q). For IRSmk, our results show performance improvements ranging from 22.23× on IBM POWER7 to 1.10× on Intel Sandybridge. For LULESH, we see improvements ranging from 1.82× on IBM POWER7 to 1.02× on Intel Sandybridge. We also developed a new optimization algorithm to recommend a layout for an input source program and specific target machine characteristics. Our results show that the performance of this automated layout algorithm outperforms the manual layouts in one case and performs within 10% of the best architecture-specific layout in all the other cases, but one.

[1]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[2]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[3]  Uday Bondhugula,et al.  Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[4]  Sandya Mannarswamy,et al.  Structure Layout Optimization for Multithreaded Programs , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[5]  Mahmut T. Kandemir,et al.  Improving whole-program locality using intra-procedural and inter-procedural transformations, , 2005, J. Parallel Distributed Comput..

[6]  Mahmut T. Kandemir,et al.  A framework for interprocedural locality optimization using both loop and data layout transformations , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[7]  T. Jones,et al.  TALC: A Simple C Language Extension For Improved Performance and Code Maintainability , 2007 .

[8]  Ian Karlin,et al.  Tuning the LULESH Mini-app for Current and Future Hardware , 2013 .

[9]  Rudolf Eigenmann,et al.  Compiler Infrastructure , 2013, International Journal of Parallel Programming.

[10]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[11]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[12]  Balaram Sinharoy,et al.  IBM POWER7 multicore server processor , 2011 .

[13]  Michael F. P. O'Boyle,et al.  Efficient parallelisation using combined loop and data transformations , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[14]  Ken Kennedy,et al.  Inter-array Data Regrouping , 1999, LCPC.

[15]  Guojing Cong,et al.  Application data prefetching on the IBM Blue Gene/Q supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Trishul M. Chilimbi,et al.  Cache-conscious coallocation of hot data streams , 2006, PLDI '06.

[17]  Mahmut T. Kandemir,et al.  Optimizing Data Layouts for Parallel Computation on Multicores , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.