An Efficient Multiway Mergesort for GPU Architectures

Sorting is a primitive operation that is a building block for countless algorithms. As such, it is important to design sorting algorithms that approach peak performance on a range of hardware architectures. Graphics Processing Units (GPUs) are particularly attractive architectures as they provides massive parallelism and computing power. However, the intricacies of their compute and memory hierarchies make designing GPU-efficient algorithms challenging. In this work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway mergesort algorithm. MMS employs a new partitioning technique that exposes the parallelism needed by modern GPU architectures. To the best of our knowledge, MMS is the first sorting algorithm for the GPU that is asymptotically optimal in terms of global memory accesses and that is completely free of shared memory bank conflicts. We realize an initial implementation of MMS, evaluate its performance on three modern GPU architectures, and compare it to competitive implementations available in state-of-the- art GPU libraries. Despite these implementations being highly optimized, MMS compares favorably, achieving performance improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art algorithms are susceptible to bank conflicts. We find that for certain inputs that cause these algorithms to incur large numbers of bank conflicts, MMS can achieve a 33.7% performance improvement over its fastest competitor. Overall, even though its current implementation is not fully optimized, due to its efficient use of the memory hierarchy, MMS outperforms the fastest comparison-based sorting implementations available to date.

[1]  Gero Greiner,et al.  Sparse Matrix Computations and their I/O Complexity , 2012 .

[2]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[3]  Andrew S. Grimshaw,et al.  Parallel Scan for Stream Architectures , 2012 .

[4]  Frank Dehne,et al.  Deterministic Sample Sort for GPUs , 2010, Parallel Process. Lett..

[5]  Krzysztof Kaczmarski,et al.  Experimental B+-tree for GPU , 2011, ADBIS.

[6]  Nodari Sitchinava,et al.  Provably Efficient GPU Algorithms , 2013, ArXiv.

[7]  David A. Bader,et al.  GPU merge path: a GPU merging algorithm , 2012, ICS '12.

[8]  Lin Ma,et al.  A Memory Access Model for Highly-threaded Many-core Architectures , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[9]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[10]  P. J. Narayanan,et al.  Discrete range searching primitive for the GPU and its applications , 2012, JEAL.

[11]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[12]  Bruce Merry,et al.  A Performance Comparison of Sort and Scan Libraries for GPUs , 2015, Parallel Process. Lett..

[13]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[14]  Michael T. Goodrich,et al.  Fundamental parallel algorithms for private-cache chip multiprocessors , 2008, SPAA '08.

[15]  Michael Garland,et al.  A decomposition for in-place matrix transposition , 2014, PPoPP '14.

[16]  P. Sadayappan,et al.  Characterizing and enhancing global memory data coalescing on GPUs , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Adi Shamir,et al.  Shear Sort: A True Two-Dimensional Sorting Techniques for VLSI Networks , 1986, ICPP.

[18]  Vitaly Osipov,et al.  GPU sample sort , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[19]  Koji Nakano,et al.  The Hierarchical Memory Machine Model for GPUs , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[20]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[21]  Stephan Olariu,et al.  Weighted and Unweighted Selection Algorithms for k Sorted Sequences , 1997, ISAAC.

[22]  Jatin Goyal,et al.  Parallel binary search trees for rapid IP lookup using graphic processors , 2013, 2013 2nd International Conference on Information Management in the Knowledge Economy.

[23]  K. Srinathan,et al.  A performance prediction model for the CUDA GPGPU platform , 2009, 2009 International Conference on High Performance Computing (HiPC).

[24]  Henri Casanova,et al.  Efficient Batched Predecessor Search in Shared Memory on GPUs , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[25]  Koji Nakano,et al.  Simple Memory Machine Models for GPUs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[26]  Franco Fummi,et al.  A fine-grained performance model for GPU architectures , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[27]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[28]  Pablo Enfedaque,et al.  Implementation of the DWT in a GPU through a Register-based Strategy , 2015, IEEE Transactions on Parallel and Distributed Systems.

[29]  Yitzhak Birk,et al.  Merge Path - A Visually Intuitive Approach to Parallel Merging , 2014, ArXiv.

[30]  Shubhabrata Sengupta,et al.  Efficient Parallel Scan Algorithms for GPUs , 2011 .

[31]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[32]  Nodari Sitchinava,et al.  Sorting and Permuting without Bank Conflicts on GPUs , 2015, ESA.