Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) manycore compute devices each with its own memory space, (2) data parallel kernels and (3) multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices – potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by (1) separating data access patterns from computational kernels through a multidimensional array API and (2) introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].
[1]
Gaël Varoquaux,et al.
The NumPy Array: A Structure for Efficient Numerical Computation
,
2011,
Computing in Science & Engineering.
[2]
David Abrahams,et al.
C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ In-Depth Series)
,
2004
.
[3]
James Reinders,et al.
Intel® threading building blocks
,
2008
.
[4]
B. J. Epstein.
List of Tables xi
,
2012
.
[5]
Todd L. Veldhuizen,et al.
Arrays in Blitz++
,
1998,
ISCOPE.