论文信息 - Manycore performance-portability: Kokkos multidimensional array library

Manycore performance-portability: Kokkos multidimensional array library

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces APIs, and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: 1 manycore compute devices each with its own memory space, 2 data parallel kernels and 3 multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices --potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by 1 separating data access patterns from computational kernels through a multidimensional array API and 2 introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].

Daniel Sunderland | H. Carter Edwards | Vicki L. Porter | Chris Amsler | Sam Mish

[1] Wen-mei W. Hwu,et al. GPU Computing Gems Jade Edition , 2011 .

[2] James Reinders,et al. Intel® threading building blocks , 2008 .

[3] Gaël Varoquaux,et al. The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[4] David Abrahams,et al. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ In-Depth Series) , 2004 .

[5] Todd L. Veldhuizen,et al. Arrays in Blitz++ , 1998, ISCOPE.

[6] Sandia Report,et al. Improving Performance via Mini-applications , 2009 .