Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

Sorting is a classic algorithmic problem and its importance has led to the design and implementation of various sorting algorithms on many-core graphics processing units GPUs. CUDPP Radix sort is the most efficient sorting on GPUs and GPU Sample sort is the best comparison-based sorting. Although the implementations of these algorithms are efficient, they either need an extra space for the data rearrangement or the atomic operation for the acceleration. Sorting applications usually deal with a large amount of data, thus the memory utilization is an important consideration. Furthermore, these sorting algorithms on GPUs without the atomic operation support can result in the performance degradation or fail to work. In this paper, an efficient implementation of a parallel shellsort algorithm, CUDA shellsort, is proposed for many-core GPUs with CUDA. Experimental results show that, on average, the performance of CUDA shellsort is nearly twice faster than GPU quicksort and 37% faster than Thrust mergesort under uniform distribution. Moreover, its performance is the same as GPU sample sort up to 32 million data elements, but only needs a constant space usage. CUDA shellsort is also robust over various data distributions and could be suitable for other many-core architectures.

[1]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[2]  Selim G. Akl,et al.  Parallel Sorting Algorithms , 1985 .

[3]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[4]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[5]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  Frank Dehne,et al.  Deterministic Sample Sort for GPUs , 2010, Parallel Process. Lett..

[8]  Philippas Tsigas,et al.  A Practical Quicksort Algorithm for Graphics Processors , 2008, ESA.

[9]  Guy E. Blelloch,et al.  An Experimental Analysis of Parallel Sorting Algorithms , 1998, Theory of Computing Systems.

[10]  Vitaly Osipov,et al.  GPU sample sort , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[11]  David A. Bader,et al.  A Randomized Parallel Sorting Algorithm with an Experimental Study , 1998, J. Parallel Distributed Comput..

[12]  Dongrui Fan,et al.  High performance comparison-based sorting algorithm on many-core GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[14]  Donald L. Shell,et al.  A high-speed sorting procedure , 1959, CACM.

[15]  Ulf Assarsson,et al.  Fast parallel GPU-sorting using a hybrid algorithm , 2008, J. Parallel Distributed Comput..

[16]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[17]  Marcin Ciura,et al.  Best Increments for the Average Case of Shellsort , 2001, FCT.

[18]  Robert Sedgewick,et al.  Analysis of Shellsort and Related Algorithms , 1996, ESA.

[19]  Norbert Luttenberger,et al.  Fast In-Place Sorting with CUDA Based on Bitonic Sort , 2009, PPAM.

[20]  Clay P. Breshears The Art of Concurrency - A Thread Monkey's Guide to Writing Parallel Applications , 2009 .

[21]  Pheng-Ann Heng,et al.  A Fast and Flexible Sorting Algorithm with CUDA , 2009, ICA3PP.