Parallel Sorting on Cache-coherent DSM Multiprocessors

The performance of parallel sorting is not well understood on hardware cache-coherent shared address space (CC-SAS) multiprocessors, which increasingly dominate the market for tightly-coupled multiprocessing. We study two high-performance parallel sorting algorithms, radix and sample sorting, under three major programming models-a load-store CC-SAS, message passing, and the segmented SHMEM model-on a 64-processor SGI Origin2000. We observe surprisingly good speedups on this demanding application. The performance of radix sort is greatly affected by the programming model and particular implementation used. Sample sort exhibits more uniform performance across programming models on this platform, but it is usually not so good as that of the best radix sort for larger data sets if each is allowed to use the best programming model for itself. The best combination of algorithm and programming model is radix sorting under the SHMEM model for larger data sets and sample sorting under CC-SAS for smaller data sets.

[1]  Andrew Sohn,et al.  Load balanced parallel radix sort , 1998, ICS '98.

[2]  David A. Bader,et al.  Parallel algorithms for personalized communication and sorting with an experimental study (extended abstract) , 1996, SPAA '96.

[3]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[4]  Marco Zagha,et al.  OriginTM 2000 and Onyx2® Performance Tuning and Optimization Guide , 1993 .

[5]  A. Chien A High Speed Disk-to-disk Sort on a Windows Nt Cluster Running Hpvm , 1999 .

[6]  Jaswinder Pal Singh,et al.  Does Application Performance Scale on Modern Cache-coherent Multiprocessors: A Case Study of a 128-processsor SGI Origin2000 , 1999, ISCA 1999.

[7]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[8]  Jaswinder Pal Singh,et al.  A comparison of MPI, SHMEM and cache-coherent shared address space programming models on the SGI Origin2000 , 1999, ICS '99.

[9]  Guy E. Blelloch,et al.  Radix sort for vector multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[10]  Andrea C. Arpaci-Dusseau,et al.  Fast Parallel Sorting Under LogP: Experience with the CM-5 , 1996, IEEE Trans. Parallel Distributed Syst..

[11]  Andrea C. Arpaci-Dusseau,et al.  High-performance sorting on networks of workstations , 1997, SIGMOD '97.

[12]  Jonathan Schaeffer,et al.  On the Versatility of Parallel Sorting by Regular Sampling , 1993, Parallel Comput..

[13]  Jaswinder Pal Singh,et al.  Scaling application performance on a cache-coherent multiprocessor , 1999, ISCA.