Super Scalar Sample Sort

Sample sort, a generalization of quicksort that partitions the input into many pieces, is known as the best practical comparison based sorting algorithm for distributed memory parallel computers. We show that sample sort is also useful on a single processor. The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions. This transformation facilitates optimizations like loop unrolling and software pipelining. The final implementation, albeit cache efficient, is limited by a linear number of memory accesses rather than the \(\mathcal{O}\!\left(n\log n\right)\) comparisons. On an Itanium 2 machine, we obtain a speedup of up to 2 over std::sort from the GCC STL library, which is known as one of the fastest available quicksort implementations.

[1]  David B. Lomet,et al.  AlphaSort: a RISC machine sort , 1994, SIGMOD '94.

[2]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[3]  C. A. R. Hoare,et al.  Algorithm 64: Quicksort , 1961, Commun. ACM.

[4]  C. SIAMJ. OPTIMAL SAMPLING STRATEGIES IN QUICKSORT AND QUICKSELECT , 2001 .

[5]  T. Grutkowski,et al.  The high-bandwidth 256 kB 2nd level cache on an Itanium microprocessor , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[6]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[7]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[8]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[9]  Gerth Stølting Brodal,et al.  Engineering a Cache-Oblivious Sorting Algorith , 2004, ALENEX/ANALC.

[10]  Rakesh Krishnaiyer,et al.  An Overview of the Intel® IA-64 Compiler , 1999 .

[11]  Richard E. Ladner,et al.  The influence of caches on the performance of sorting , 1997, SODA '97.

[12]  Ramesh C. Agarwal,et al.  A super scalar sort algorithm for RISC processors , 1996, SIGMOD '96.

[13]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[14]  Naila Rahman Algorithms for Hardware Caches and TLB , 2002, Algorithms for Memory Hierarchies.

[15]  C. A. R. Hoare Algorithm 63: partition , 1961, CACM.

[16]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[17]  Josep-Lluís Larriba-Pey,et al.  Case Study: Memory Conscious Parallel Sorting , 2002, Algorithms for Memory Hierarchies.

[18]  David R. Musser,et al.  Introspective Sorting and Selection Algorithms , 1997, Softw. Pract. Exp..

[19]  Peter Sanders,et al.  Fast priority queues for cached memory , 1999, JEAL.

[20]  Conrado Martínez,et al.  Optimal Sampling Strategies in Quicksort and Quickselect , 2002, SIAM J. Comput..

[21]  Sandeep Sen,et al.  Towards a theory of cache-efficient algorithms , 2000, SODA '00.

[22]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[23]  Ulrich Meyer,et al.  Algorithms for Memory Hierarchies , 2003, Lecture Notes in Computer Science.

[24]  Sonal Kothari,et al.  Register Efficient Mergesorting , 2000, HiPC.

[25]  Jeffrey Scott Vitter,et al.  Efficient Sorting Using Registers and Caches , 2000, Algorithm Engineering.

[26]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[27]  Kurt Mehlhorn,et al.  Scanning Multiple Sequences via Cache Memory , 2002, Algorithmica.