TiDA: High-Level Programming Abstractions for Data Locality Management

The high energy costs for data movement compared to computation gives paramount importance to data locality management in programs. Managing data locality manually is not a trivial task and also complicates programming. Tiling is a well-known approach that provides both data locality and parallelism in an application. However, there is no standard programming construct to express tiling at the application level. We have developed a multicore programming model, TiDA, based on tiling and implemented the model as C++ and Fortran libraries. The proposed programming model has three high level abstractions, tiles, regions and tile iterator. These abstractions in the library hide the details of data decomposition, cache locality optimizations, and memory affinity management in the application. In this paper we unveil the internals of the library and demonstrate the performance and programability advantages of the model on five applications on multiple NUMA nodes. The library achieves up to 2.10x speedup over OpenMP in a single compute node for simple kernels, and up to 22x improvement over a single thread for a more complex combustion proxy application (SMC) on 24 cores. The MPI+TiDA implementation of geometric multigrid demonstrates a 30.9 % performance improvement over MPI+OpenMP when scaling to 3072 cores (excluding MPI communication overheads, 8.5 % otherwise).

[1]  J. B. Bell,et al.  High-order algorithms for compressible reacting flow with complex chemistry , 2013, 1309.7327.

[2]  Sanjay V. Rajopadhye,et al.  Multi-level tiling: M for the price of one , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[3]  Zhigang Mao,et al.  An application specific NoC mapping for optimized delay , 2006, International Conference on Design and Test of Integrated Systems in Nanoscale Technology, 2006. DTIS 2006..

[4]  John Shalf,et al.  Programming Abstractions for Data Locality , 2014 .

[5]  John Shalf,et al.  BoxLib with Tiling: An Adaptive Mesh Refinement Software Framework , 2016, SIAM J. Sci. Comput..

[6]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[7]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[8]  Radu Marculescu,et al.  Energy-aware mapping for tile-based NoC architectures under performance constraints , 2003, ASP-DAC '03.

[9]  Sriram Krishnamoorthy,et al.  Parametric multi-level tiling of imperfectly nested loops , 2009, ICS.

[10]  Davide Bertozzi,et al.  Supporting Task Migration in Multi-Processor Systems-on-Chip: A Feasibility Study , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[11]  Srinivasan Murali,et al.  Bandwidth-constrained mapping of cores onto NoC architectures , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[12]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[13]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[14]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[15]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Karl Fürlinger,et al.  Expressing and Exploiting Multi-Dimensional Locality in DASH , 2016, Software for Exascale Computing.

[17]  John Shalf,et al.  Tiling as a Durable Abstraction for Parallelism and Data Locality , 2013 .

[18]  Sanjay V. Rajopadhye,et al.  Parameterized tiled loops for free , 2007, PLDI '07.

[19]  Brice Goglin,et al.  Managing the topology of heterogeneous cluster nodes with hardware locality (hwloc) , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[20]  Chun Chen,et al.  Loop Transformation Recipes for Code Generation and Auto-Tuning , 2009, LCPC.

[21]  Daniel Sunderland,et al.  Manycore performance-portability: Kokkos multidimensional array library , 2012 .

[22]  Chita R. Das,et al.  Application-aware prioritization mechanisms for on-chip networks , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[24]  Mateo Valero,et al.  Breaking the bandwidth wall in chip multiprocessors , 2011, 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[25]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[26]  John Shalf,et al.  BoxLib with Tiling: An AMR Software Framework , 2016, ArXiv.

[27]  Samuel Williams,et al.  ExaSAT: An exascale co-design tool for performance modeling , 2015, Int. J. High Perform. Comput. Appl..

[28]  Haibo Chen,et al.  Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling , 2013, TACO.

[29]  Mauro Bianco,et al.  A Generic Strategy for Multi-stage Stencils , 2014, Euro-Par.