Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces APIs, and performance requirements. The Kokkos Array programming model provides library-based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: 1 manycore compute devices each with its own memory space, 2 data parallel kernels and 3 multidimensional arrays. Kernel execution performance is, especially for NVIDIA® devices, extremely dependent on data access patterns. Optimal data access pattern can be different for different manycore devices --potentially leading to different implementations of computational kernels specialized for different devices. The Kokkos Array programming model supports performance-portable kernels by 1 separating data access patterns from computational kernels through a multidimensional array API and 2 introduce device-specific data access mappings when a kernel is compiled. An implementation of Kokkos Array is available through Trilinos [Trilinos website, http://trilinos.sandia.gov/, August 2011].
[1]
Wen-mei W. Hwu,et al.
GPU Computing Gems Jade Edition
,
2011
.
[2]
James Reinders,et al.
Intel® threading building blocks
,
2008
.
[3]
Gaël Varoquaux,et al.
The NumPy Array: A Structure for Efficient Numerical Computation
,
2011,
Computing in Science & Engineering.
[4]
David Abrahams,et al.
C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ In-Depth Series)
,
2004
.
[5]
Todd L. Veldhuizen,et al.
Arrays in Blitz++
,
1998,
ISCOPE.
[6]
Sandia Report,et al.
Improving Performance via Mini-applications
,
2009
.