Characterizing the Sort Operation on Multithreaded Architectures

The Sort operation is a core part of many critical applications. Despite the large efforts to parallelize it, the fact that it suffers from high data-dependencies vastly limits its performance. Multithreaded architectures are emerging as the most demanding technology in leading-edge processors. These architectures include Simultaneous Multithreading, Chip Multiprocessors and machines combining different multithreading technologies. In this paper, we analyze the memory behavior and improve the performance of the most recent parallel radix and quick integer sort algorithms on modern multithreaded architectures. We achieve speedups up to 4.69x for radix sort and up to 4.17x for quick sort on a machine with 4 multithreaded processors compared to single threaded versions, respectively. We find that since radix sort is CPU-intensive, it exhibits better results on Chip multiprocessors where multiple CPUs are available. While quick sort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing.

[1]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[2]  Li Xiao,et al.  Improving memory performance of sorting algorithms , 2000, JEAL.

[3]  Thorsten von Eicken,et al.  技術解説 IEEE Computer , 1999 .

[4]  Josep-Lluís Larriba-Pey,et al.  CC-Radix: a cache conscious sorting based on Radix sort , 2003, Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings..

[5]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[6]  Josep-Lluís Larriba-Pey,et al.  An analysis of superscalar sorting algorithms on an R8000 processor , 1997, Proceedings 17th International Conference of the Chilean Computer Science Society.

[7]  Robert Sedgewick,et al.  Implementing Quicksort programs , 1978, CACM.

[8]  Naila Rahman,et al.  Analysing the Cache Behaviour of Non-uniform Distribution Sorting Algorithms , 2000, ESA.

[9]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[10]  Naila Rahman,et al.  Adapting Radix Sort to the Memory Hierarchy , 2001, JEAL.

[11]  Josep-Lluís Larriba-Pey,et al.  Fast parallel in-memory 64-bit sorting , 2001, ICS '01.

[12]  Yi Zhang,et al.  A simple, fast parallel implementation of Quicksort and its performance evaluation on SUN Enterprise 10000 , 2003, Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings..

[13]  D. Marr,et al.  Hyper-Threading Technology Architecture and MIcroarchitecture , 2002 .

[14]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[15]  Andrew Sohn,et al.  Load balanced parallel radix sort , 1998, ICS '98.

[16]  Yi Zhang,et al.  Parallel Quicksort Seems to Outperform Sample Sort on Cache-coherent Shared Memory Multiprocessors : An Evaluation on SUN ENTERPRISE 10000 ∗ , 2002 .

[17]  Andrew Sohn,et al.  Partitioned Parallel Radix Sort , 2000, J. Parallel Distributed Comput..

[18]  Goetz Graefe,et al.  Implementing sorting in database systems , 2006, CSUR.

[19]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[20]  David J. DeWitt,et al.  Parallel sorting on a shared-nothing architecture using probabilistic splitting , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[21]  Richard E. Ladner,et al.  The influence of caches on the performance of sorting , 1997, SODA '97.

[22]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[23]  Margaret Martonosi,et al.  Hardware-modulated parallelism in chip multiprocessors , 2005, CARN.