Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sorting algorithms for Graphics Processing Units (GPUs). We experimentally validate a proposed formula that relates these parameters with asymptotic analysis of the number of memory accesses by an algorithm. Using this formula we analyze and compare several GPU sorting algorithms, identifying key performance bottlenecks in each one of them. Based on this analysis we propose a GPU-efficient multiway merge-sort algorithm, GPU-MMS, which minimizes or eliminates these bottlenecks and balances various limiting factors for specific hardware. We realize an implementation of GPU-MMS and compare it to sorting algorithm implementations in state-of-the-art GPU libraries on three GPU architectures. Despite these library implementations being highly optimized, we find that GPU-MMS outperforms them by an average of 21% for random integer inputs and 14% for random key-value pairs.

[1]  K. Srinathan,et al.  A performance prediction model for the CUDA GPGPU platform , 2009, 2009 International Conference on High Performance Computing (HiPC).

[2]  Jatin Goyal,et al.  Parallel binary search trees for rapid IP lookup using graphic processors , 2013, 2013 2nd International Conference on Information Management in the Knowledge Economy.

[3]  Frank Dehne,et al.  Deterministic Sample Sort for GPUs , 2010, Parallel Process. Lett..

[4]  Henri Casanova,et al.  Efficient Batched Predecessor Search in Shared Memory on GPUs , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[5]  Koji Nakano,et al.  Simple Memory Machine Models for GPUs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[6]  Franco Fummi,et al.  A fine-grained performance model for GPU architectures , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Bruce Merry,et al.  A Performance Comparison of Sort and Scan Libraries for GPUs , 2015, Parallel Process. Lett..

[8]  Stephan Olariu,et al.  Weighted and Unweighted Selection Algorithms for k Sorted Sequences , 1997, ISAAC.

[9]  Nodari Sitchinava,et al.  Provably Efficient GPU Algorithms , 2013, ArXiv.

[10]  Koji Nakano,et al.  The Hierarchical Memory Machine Model for GPUs , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[11]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[12]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[13]  Shubhabrata Sengupta,et al.  Efficient Parallel Scan Algorithms for GPUs , 2011 .

[14]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[15]  Nodari Sitchinava,et al.  Sorting and Permuting without Bank Conflicts on GPUs , 2015, ESA.

[16]  Timothy J. Purcell Sorting and searching , 2005, SIGGRAPH Courses.

[17]  Lin Ma,et al.  A Memory Access Model for Highly-threaded Many-core Architectures , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[18]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[19]  Kunihiko Sadakane,et al.  A Novel Computational Model for GPUs with Applications to Efficient Algorithms , 2015, Int. J. Netw. Comput..

[20]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[21]  Krzysztof Kaczmarski,et al.  Experimental B+-tree for GPU , 2011, ADBIS.

[22]  P. Sadayappan,et al.  Characterizing and enhancing global memory data coalescing on GPUs , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[23]  Vitaly Osipov,et al.  GPU sample sort , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[24]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[25]  Yitzhak Birk,et al.  Merge Path - A Visually Intuitive Approach to Parallel Merging , 2014, ArXiv.

[26]  Pablo Enfedaque,et al.  Implementation of the DWT in a GPU through a Register-based Strategy , 2015, IEEE Transactions on Parallel and Distributed Systems.

[27]  Andrew S. Grimshaw,et al.  Revisiting sorting for GPGPU stream architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Lin Ma,et al.  Performance modeling for highly-threaded many-core GPUs , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[29]  Andrew S. Grimshaw,et al.  Parallel Scan for Stream Architectures , 2012 .

[30]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[31]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[32]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[33]  Ben Karsin,et al.  A Performance Model For Gpu Architectures: Analysis And Design Of Fundamental Algorithms , 2018 .

[34]  P. J. Narayanan,et al.  Discrete range searching primitive for the GPU and its applications , 2012, JEAL.

[35]  David A. Bader,et al.  GPU merge path: a GPU merging algorithm , 2012, ICS '12.

[36]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[37]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[38]  Michael Garland,et al.  A decomposition for in-place matrix transposition , 2014, PPoPP '14.