Fast and Robust Vectorized In-Place Sorting of Primitive Types

Modern CPUs provide single instruction-multiple data (SIMD) instructions. SIMD instructions process several elements of a primitive data type simultaneously in fixed-size vectors. Classical sorting algorithms are not directly expressible in SIMD instructions. Accelerating sorting algorithms with SIMD instruction is therefore a creative endeavor. A promising approach for sorting with SIMD instructions is to use sorting networks for small arrays and Quicksort for large arrays. In this paper we improve vectorization techniques for sorting networks and Quicksort. In particular, we show how to use the full capacity of vector registers in sorting networks and how to make vectorized Quicksort robust with respect to different key distributions. To demonstrate the performance of our techniques we implement an in-place hybrid sorting algorithm for the data type int with AVX2 intrinsics. Our implementation is at least 30% faster than state-of-the-art high-performance sorting alternatives. 2012 ACM Subject Classification Theory of computation → Sorting and searching

[1]  Ronald T. Kneusel,et al.  Random Numbers and Computers , 2018, Springer International Publishing.

[2]  Harold S. Stone Sorting on STAR , 1978, IEEE Transactions on Software Engineering.

[3]  Ian Parberry,et al.  The Pairwise Sorting Network , 1992, Parallel Process. Lett..

[4]  Bertil Schmidt,et al.  Efficient Parallel Sort on AVX-512-Based Multi-Core and Many-Core Architectures , 2019, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[5]  Jay L. Devore,et al.  Modern Mathematical Statistics with Applications , 2021, Springer Texts in Statistics.

[6]  David R. O'Hallaron,et al.  Computer Systems: A Programmer's Perspective , 1991 .

[7]  Guy E. Blelloch,et al.  Radix sort for vector multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[8]  S. A. Levin,et al.  A fully vectorized quicksort , 1990, Parallel Comput..

[9]  Hiroshi Inoue,et al.  SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures , 2015, Proc. VLDB Endow..

[10]  F. Gustavson,et al.  Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine , 1984 .

[11]  John Beidler,et al.  Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[12]  Peter Sanders,et al.  In-place Parallel Super Scalar Samplesort (IPSSSSo) , 2017, ESA.

[13]  Shay Gueron,et al.  Fast Quicksort Implementation Using AVX Instructions , 2016, Comput. J..

[14]  Peter Schneider-Kamp,et al.  Sorting Networks: to the End and Back Again , 2015, J. Comput. Syst. Sci..

[15]  Toshio Nakatani,et al.  AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[16]  Howard Jay Siegel,et al.  The universality of various types of SIMD machine interconnection networks , 1977, ISCA '77.

[17]  Jens Vygen,et al.  The Book Review Column1 , 2020, SIGACT News.

[18]  Adrian Kaehler,et al.  Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library , 2016 .

[19]  Pradeep Dubey,et al.  Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..

[20]  Tapio Lahdenmäki,et al.  Relational Database Index Design and the Optimizers , 2005 .

[21]  David A. Bader,et al.  A Randomized Parallel Sorting Algorithm with an Experimental Study , 1998, J. Parallel Distributed Comput..

[22]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[23]  Jakob Andreas Bærentzen,et al.  Guide to Computational Geometry Processing , 2012, Springer London.

[24]  Hussein Abdel-jaber,et al.  Efficient Non-Quadratic Quick Sort (NQQuickSort) , 2011, DEIS.

[25]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[26]  Martin Weisser Essential Programming for Linguistics , 2010, Computational Linguistics.

[27]  Ron Shamir,et al.  Sorting by Translocations Via Reversals Theory , 2006, Comparative Genomics.

[28]  Berenger Bramas A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake , 2017 .

[29]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[30]  Michael Griebel,et al.  Numerical Simulation in Molecular Dynamics: Numerics, Algorithms, Parallelization, Applications , 2007 .

[31]  Kia Bazargan,et al.  Computation on Stochastic Bit Streams Digital Image Processing Case Studies , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.