Memory access patterns: the missing piece of the multi-GPU puzzle

With the increased popularity of multi-GPU nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local GPUs and the low-latency high-throughput interconnects that link them, programmers need to meticulously adapt parallel applications with respect to load balancing, boundary conditions and device synchronization. This paper presents MAPS-Multi, an automatic multi-GPU partitioning framework that distributes the workload based on the underlying memory access patterns. The framework consists of host- and device-level APIs that allow programs to efficiently run on a variety of GPU and multi-GPU architectures. The framework implements several layers of code optimization, device abstraction, and automatic inference of inter-GPU memory exchanges. The paper demonstrates that the performance of MAPS-Multi achieves near-linear scaling on fundamental computational operations, as well as real-world applications in deep learning and multivariate analysis.

[1]  Martin Lilleeng Sætra,et al.  Shallow Water Simulations on Multiple GPUs , 2010, PARA.

[2]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[3]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[4]  Stefan Marr,et al.  Partitioned Global Address Space Languages , 2015, ACM Comput. Surv..

[5]  Corporate Rice University,et al.  High performance Fortran language specification , 1993, FORF.

[6]  Master Gardener,et al.  Mathematical games: the fantastic combinations of john conway's new solitaire game "life , 1970 .

[7]  D Bonachea,et al.  UPC Language and Library Specifications, Version 1.3 , 2013 .

[8]  Inanc Senocak,et al.  CUDA Implementation of a Navier-Stokes Solver on Multi-GPU Desktop Platforms for Incompressible Flows , 2009 .

[9]  MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction , 2014 .

[10]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[11]  Lawrence Snyder,et al.  A programmer's guide to ZPL , 1999 .

[12]  Martin Uecker,et al.  A Multi-GPU Programming Library for Real-Time Applications , 2012, ICA3PP.

[13]  Francisco Tirado,et al.  NMF-mGPU: non-negative matrix factorization on multi-GPU systems , 2015, BMC Bioinformatics.

[14]  Wolfgang Straßer,et al.  A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[15]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[16]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[17]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[20]  Uday Bondhugula,et al.  Automatic data allocation and buffer management for multi-GPU machines , 2013, TACO.

[21]  Samy Bengio,et al.  Torch: a modular machine learning software library , 2002 .

[22]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[23]  Sergei Gorlatch,et al.  Towards High-Level Programming of Multi-GPU Systems Using the SkelCL Library , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[24]  Giovanni Gallo,et al.  Advances in Multi-GPU Smoothed Particle Hydrodynamics Simulations , 2014, IEEE Transactions on Parallel and Distributed Systems.

[25]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[26]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[27]  Ioannis E. Venetis,et al.  High performance MRI simulations of motion on multi-GPU systems , 2014, Journal of Cardiovascular Magnetic Resonance.