Efficient String Sort with Multi-Character Encoding and Adaptive Sampling

Sorting plays a fundamental role in computer science. It has far reaching applications in database operations and data science tasks. An important class of sorting keys are strings and among all string sorting methods, radix sort is a simple but effective algorithm. Many works have been studied to accelerate radix string sort. One typical approach is to process multiple characters in each sorting pass. However, this approach incurs the crucial issue of the radix being too large. To address the problem, we introduce a novel multi-character encoding based method that can significantly reduce the radix. This new encoding scheme takes advantage of the sparse alphabet space usage as well as the sparsity of distinguishing prefixes of the inputs which are commonly seen in real-world datasets. Combining the effective encoding scheme with an adaptive sampling process to generate the encoding efficiently, our proposed sorting algorithm essentially blends radix sort with sample sort and achieves substantial improvement over other sorting approaches. The results on both real datasets and synthetic datasets show that our method yields an average 4.85× performance improvement over C++ STL sort[21], 1.47× improvement over the state-of-the-art Radix Sort on strings implementation[19] and 2.55× over the multikey quicksort[6]. Preliminary tests in a multi-core environment also show it is competitive or better than the most recent parallel string sorting algorithm pS5[8] which demonstrates the scalability of our method.

[1]  Peter Sanders,et al.  Engineering Parallel String Sorting , 2014, Algorithmica.

[2]  Justin Zobel,et al.  Cache-conscious sorting of large sets of strings with dynamic tries , 2004, JEAL.

[3]  Julian Shun,et al.  Theoretically-Efficient and Practical Parallel In-Place Radix Sorting , 2019, SPAA.

[4]  Justin Zobel,et al.  Efficient Trie-Based Sorting of Large Sets of Strings , 2003, ACSC.

[5]  Peter Sanders,et al.  Parallel String Sample Sort , 2013, ESA.

[6]  Katsuhiko Kakehi,et al.  Merging String Sequences by Longest Common Prefixes , 2008 .

[7]  Josep-Lluís Larriba-Pey,et al.  CC-Radix: a cache conscious sorting based on Radix sort , 2003, Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings..

[8]  Keith Bostic,et al.  Engineering Radix Sort , 1993, Comput. Syst..

[9]  Arne Andersson,et al.  A new efficient radix sort , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[10]  Naila Rahman,et al.  Adapting Radix Sort to the Memory Hierarchy , 2001, JEAL.

[11]  Timo Bingmann,et al.  Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools , 2018, ArXiv.

[12]  Jörg Arndt Mixed radix numbers , 2011 .

[13]  Tim Kraska,et al.  The Case for a Learned Sorting Algorithm , 2020, SIGMOD Conference.

[14]  Parosh Aziz Abdulla Radix Sort , 2011, Encyclopedia of Parallel Computing.

[15]  Gianni Franceschini,et al.  Radix Sorting with No Extra Space , 2007, ESA.

[16]  Anthony Wirth,et al.  Engineering burstsort: Toward fast in-place string sorting , 2010, JEAL.

[17]  Katsuhiko Kakehi,et al.  Cache Efficient Radix Sort for String Sorting , 2007, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[18]  Justin Zobel,et al.  Using random sampling to build approximate tries for efficient string sorting , 2004, JEAL.

[19]  Daniel Brand,et al.  PARADIS: An Efficient Parallel Algorithm for In-place Radix Sort , 2015, Proc. VLDB Endow..

[20]  Juha Kärkkäinen,et al.  Engineering Radix Sort for Strings , 2008, SPIRE.

[21]  Robert E. Tarjan,et al.  Three Partition Refinement Algorithms , 1987, SIAM J. Comput..

[22]  Peter Sanders,et al.  Engineering a Multi-core Radix Sort , 2011, Euro-Par.

[23]  Guy E. Blelloch,et al.  Algorithmic Building Blocks for Asymmetric Memories , 2018, ESA.

[24]  David R. Musser,et al.  Introspective Sorting and Selection Algorithms , 1997, Softw. Pract. Exp..

[25]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[26]  Stephen J. Smith,et al.  An improved supercomputer sorting benchmark , 1992, Proceedings Supercomputing '92.

[27]  Sebastian Winkel,et al.  Super Scalar Sample Sort , 2004, ESA.

[28]  Arne Andersson,et al.  Implementing radixsort , 1998, JEAL.