Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

Abstract The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism ( e.g. , OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address a growing list of applications and domain libraries.

[1]  Roger P. Pawlowski,et al.  Automating embedded analysis capabilities and managing software complexity in multiphysics simulation, Part II: Application to partial differential equations , 2012, Sci. Program..

[2]  Daniel Sunderland,et al.  Manycore performance-portability: Kokkos multidimensional array library , 2012 .

[3]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[4]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[5]  Timothy G. Mattson,et al.  Patterns for parallel programming , 2004 .

[6]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[7]  Alan B. Williams,et al.  A Light-weight API for Portable Multicore Programming , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[8]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[9]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[10]  Jean-François Méhaut,et al.  SGPU-2: a runtime system for using large applications on clusters of hybrid nodes , 2011 .

[11]  Edward A. Luke,et al.  Loci: a rule-based framework for parallel multi-disciplinary simulation synthesis , 2005, J. Funct. Program..

[12]  Vassilios V. Dimakopoulos,et al.  HOMPI: A Hybrid Programming Framework for Expressing and Deploying Task-Based Parallelism , 2011, Euro-Par.

[13]  Daniel Sunderland,et al.  Kokkos Array performance-portable manycore programming model , 2012, PMAM '12.

[14]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[16]  Roger P. Pawlowski,et al.  Automating embedded analysis capabilities and managing software complexity in multiphysics simulation, Part I: Template-based generic programming , 2012, Sci. Program..

[17]  Daniel Sunderland,et al.  Multicore/GPGPU Portable Computational Kernels via Multidimensional Arrays , 2011, 2011 IEEE International Conference on Cluster Computing.

[18]  Ade Miller,et al.  C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++ , 2012 .

[19]  Robert A. van de Geijn,et al.  Towards Usable and Lean Parallel Linear Algebra Libraries , 1996 .

[20]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[21]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[22]  Daniel Sunderland,et al.  Manycore performance-portability: Kokkos multidimensional array library , 2012, Sci. Program..

[23]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[24]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[25]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..