Provably Efficient GPU Algorithms

In this paper we present an abstract model for algorithm design on GPUs by extending the parallel external memory (PEM) model with computations in internal memory (commonly known as shared memory in GPU literature) defined in the presence of memory banks and bank conflicts. We also present a framework for designing bank conflict free algorithms on GPUs. Using our framework we develop the first shared memory sorting algorithm that incurs no bank conflicts. Our sorting algorithm can be used as a subroutine for comparison-based GPU sorting algorithms to replace current use of sorting networks in shared memory. We show experimentally that such substitution improves the runtime of the mergesort implementation of the THRUST library.

[1]  Dongrui Fan,et al.  High performance comparison-based sorting algorithm on many-core GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[3]  Robert M. Farber,et al.  CUDA Application Design and Development , 2011 .

[4]  Edward F. Grove,et al.  External-memory graph algorithms , 1995, SODA '95.

[5]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[6]  Andrew S. Grimshaw,et al.  Parallel Scan for Stream Architectures , 2012 .

[7]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[8]  Michael T. Goodrich,et al.  Fundamental parallel algorithms for private-cache chip multiprocessors , 2008, SPAA '08.

[9]  Adi Shamir,et al.  Shear Sort: A True Two-Dimensional Sorting Techniques for VLSI Networks , 1986, ICPP.

[10]  Vitaly Osipov,et al.  GPU sample sort , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[11]  Harold S. Stone,et al.  A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[12]  David A. Bader,et al.  GPU merge path: a GPU merging algorithm , 2012, ICS '12.

[13]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[14]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[15]  Wu-chun Feng,et al.  To GPU synchronize or not GPU synchronize? , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[16]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[17]  Jop F. Sibeyn,et al.  Algorithms for Memory Hierarchies: Advanced Lectures , 2003 .