Can GPUs sort strings efficiently?

String sorting or variable-length key sorting has lagged in performance on the GPU even as the fixed-length key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this paper, we present a fast and efficient string sort on the GPU that is built on the available radix sort. Our method sorts strings from left to right in steps, moving only indexes and small prefixes for efficiency. We reduce the number of sort steps by adaptively consuming maximum string bytes based on the number of segments in each step. Performance is improved by using Thrust primitives for most steps and by removing singleton segments from consideration. Over 70% of the string sort time is spent on Thrust primitives. This provides high performance along with high adaptability to future GPUs. We achieve speed of up to 10 over current GPU methods, especially on large datasets. We also scale to much larger input sizes. We present results on easy and difficult strings defined using their after-sort tie lengths.

[1]  A. Grimshaw,et al.  High Performance and Scalable Radix Sorting: a Case Study of Implementing Dynamic Parallelism for GPU Computing , 2011, Parallel Process. Lett..

[2]  Anthony Wirth,et al.  Engineering Burstsort: Towards Fast In-Place String Sorting , 2008, WEA.

[3]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Justin Zobel,et al.  Cache-efficient string sorting using copying , 2007, ACM J. Exp. Algorithmics.

[5]  Philippas Tsigas,et al.  GPU-Quicksort: A practical Quicksort algorithm for graphics processors , 2010, JEAL.

[6]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[7]  Bingsheng He,et al.  Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[8]  Juha Kärkkäinen,et al.  Engineering Radix Sort for Strings , 2008, SPIRE.

[9]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[10]  Anthony Wirth,et al.  Engineering burstsort: Toward fast in-place string sorting , 2010, JEAL.

[11]  P J Narayanan,et al.  Fast minimum spanning tree for large graphs on the GPU , 2009, High Performance Graphics.

[12]  Parikshit Sakurikar,et al.  Fast, Scalable Parallel Comparison Sort on Hybrid Multicore Architectures , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[13]  Andrew A. Davidson,et al.  Efficient parallel merge sort for fixed and variable length keys , 2012, 2012 Innovative Parallel Computing (InPar).

[14]  Yao Zhang,et al.  Parallel lossless data compression on the GPU , 2012, 2012 Innovative Parallel Computing (InPar).

[15]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[16]  Keith Bostic,et al.  Engineering Radix Sort , 1993, Comput. Syst..

[17]  Dinesh Manocha,et al.  Fast BVH Construction on GPUs , 2009, Comput. Graph. Forum.

[18]  Charles T. Loop,et al.  Fast Ray Sorting and Breadth‐First Packet Traversal for GPU Ray Tracing , 2010, Comput. Graph. Forum.

[19]  John D. Owens,et al.  Real-time parallel hashing on the GPU , 2009, SIGGRAPH 2009.

[20]  Vitaly Osipov,et al.  GPU sample sort , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).