Fast parallel sorting under logp: from theory to practice

1.1 ABSTRACT The LogP model characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) under LogP. We develop implementations of these algorithms in a parallel extension to C and compare the actual performance on a CM-5 of 32 to 512 processors with that predicted by LogP using parameter values for this machine. Our experience was that the model served as a valuable guide throughout the development of the fast parallel sorts and revealed subtle defects in the implementations. The nal observed performance matches closely with the prediction across a broad range of problem and machine sizes. 1.2 INTRODUCTION Fast sorting is important in a wide variety of practical applications, is interesting to study from a theoretical viewpoint, and ooers a wealth of novel parallel solutions. The richness of this particular problem arises, in part, because it fundamentally requires communication as well as computation. Thus, sorting is an excellent area in which to investigate the translation from theory to practice of novel parallel algorithms on large parallel systems. In current (1993) technology, \fast parallel sorting" corresponds to a practical performance target of \sorting a billion large keys on a thousand processors Book title and editor name c 1992 John Wiley & Sons Ltd

[1]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[2]  Selim G. Akl,et al.  Design and analysis of parallel algorithms , 1985 .

[3]  Frank Thomson Leighton,et al.  Tight Bounds on the Complexity of Parallel Sorting , 1985, IEEE Trans. Computers.

[4]  Leslie G. Valiant,et al.  A logarithmic time sort for linear size networks , 1982, STOC.

[5]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[6]  Guy E. Blelloch,et al.  Radix sort for vector multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[7]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[8]  William Aiello,et al.  An atomic model for message-passing , 1993, SPAA '93.

[9]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[10]  Richard M. Karp,et al.  Optimal broadcast and summation in the LogP model , 1993, SPAA '93.