Fast Parallel Sorting Under LogP: Experience with the CM-5

In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms.

[1]  Richard Cole,et al.  The APRAM: incorporating asynchrony into the PRAM model , 1989, SPAA '89.

[2]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[3]  Mihalis Yannakakis,et al.  Towards an Architecture-Independent Analysis of Parallel Algorithms , 1990, SIAM J. Comput..

[4]  Stephen J. Smith,et al.  An improved supercomputer sorting benchmark , 1992, Proceedings Supercomputing '92.

[5]  Paul G. Spirakis,et al.  Efficient robust parallel computations , 2018, STOC '90.

[6]  Alexander A. Shvartsman,et al.  Efficient Parallel Algorithms Can Be Made Robust , 1989, PODC.

[7]  Frank Thomson Leighton,et al.  Tight Bounds on the Complexity of Parallel Sorting , 1985, IEEE Trans. Computers.

[8]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[9]  Leslie G. Valiant,et al.  A logarithmic time sort for linear size networks , 1982, STOC.

[10]  Leslie G. Valiant,et al.  A logarithmic time sort for linear size networks , 1982, STOC 1983.

[11]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[12]  Phillip B. Gibbons A more practical PRAM model , 1989, SPAA '89.

[13]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[14]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[15]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[16]  Richard M. Karp,et al.  Optimal broadcast and summation in the LogP model , 1993, SPAA '93.

[17]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[18]  유관종,et al.  Supercomputing , 2018, Communications in Computer and Information Science.

[19]  Friedhelm Meyer auf der Heide,et al.  Efficient PRAM simulation on a distributed memory machine , 1992, STOC '92.

[20]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[21]  Alok Aggarwal,et al.  On communication latency in PRAM computations , 1989, SPAA '89.

[22]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[23]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[24]  Guy E. Blelloch,et al.  Radix sort for vector multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[25]  William Aiello,et al.  An atomic model for message-passing , 1993, SPAA '93.

[26]  Charles U. Martel,et al.  Asynchronous PRAMs with Memory Latency , 1994, J. Parallel Distributed Comput..

[27]  Amotz Bar-Noy,et al.  Designing broadcasting algorithms in the postal model for message-passing systems , 1992, SPAA '92.