Practical parallel algorithms for personalized communication and integer sorting

A fundamental challenge for parallel computing is to obtain high-level, architecture independent, algorithms which efficiently execute on general-purpose parallel machines. With the emergence of message passing standards such as MPI, it has become easier to design efficient and portable parallel algorithms by making use of these communication primitives. While existing primitives allow an assortment of collective communication routines, they do not handle an important communication event when most or all processors have non-uniformly sized personalized messages to exchange with each other. We focus in this paper on the h-relation personalized communication whose efficient implementation will allow high performance implementations of a large class of algorithms. While most previous h-relation algorithms use randomization, this paper presents a new deterministic approach for h-relation personalized communication with asymptotically optimal complexity for h>p2. As an application, we present an efficient algorithm for stable integer sorting. The algorithms presented in this paper have been coded in Split-C and run on a variety of platforms, including the Thinking Machines CM-5, IBM SP-1 and SP-2, Cray Research T3D, Meiko Scientific CS-2, and the Intel Paragon. Our experimental results are consistent with the theoretical analysis and illustrate the scalability and efficiency of our algorithms across different platforms. In fact, they seem to outperform all similar algorithms known to the authors on these platforms.

[1]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[2]  S. Lennart Johnsson,et al.  Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[3]  Geoffrey C. Fox,et al.  Complete exchange on a wormhole routed mesh , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[4]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[5]  Torsten Suel Routing and sorting on meshes with row and column buses , 1994, Proceedings of 8th International Parallel Processing Symposium.

[6]  Sajal K. Das,et al.  Efficient Communication in the Folded Petersen Interconnection Network , 1994, PARLE.

[7]  Sanjay Ranka,et al.  The Transportation Primitive , 1994 .

[8]  Danny Krizanc Integer sorting on a mesh-connected array of processors , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[9]  Katherine Yelick,et al.  Introduction to Split-C , 1995 .

[10]  Kwan Woo Ryu,et al.  The block distributed memory model for shared memory multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[11]  Rajeev Thakur,et al.  All-to-all communication on meshes with wormhole routing , 1994, Proceedings of 8th International Parallel Processing Symposium.

[12]  Tseng-Hui Lin,et al.  Distributed scheduling of unstructured collective communication on the CM-5 , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[13]  Remzi H. Arpaci-Dusseau,et al.  Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[14]  林 憲一,et al.  All-to-All Personalized Communication on a Wraparound Mesh , 1992 .

[15]  Ichiro Suzuki,et al.  A Practical Algorithm for Integer Sorting on a Mesh-connected Computer , 1997, Parallel Algorithms Appl..

[16]  Shahid H. Bokhari,et al.  Complete exchange on the iPSC-860 , 1991 .

[17]  David A. Bader,et al.  Practical parallel algorithms for dynamic data redistribution, median finding, and selection , 1995, Proceedings of International Conference on Parallel Processing.

[18]  Steven Heller,et al.  Congestion-Free Routing on the CM-5 Data Router , 1994, PCRCW.

[19]  D. S. Scott All-to-All Communication Patterns in Hypercubes and Mesh Topologies , 1991 .

[20]  Andrea Carol Dusseau Modeling Parallel Sorts with LogP on the CM-5 , 1994 .

[21]  Jehoshua Bruck,et al.  CCL: A Portable and Tunable Collective Communication Library for Scalable Parallel Computers , 1995, IEEE Trans. Parallel Distributed Syst..

[22]  Kwan Woo Ryu,et al.  The Block Distributed Memory Model , 1996, IEEE Trans. Parallel Distributed Syst..

[23]  John N. Tsitsiklis,et al.  Optimal Communication Algorithms for Hypercubes , 1991, J. Parallel Distributed Comput..

[24]  Jehoshua Bruck,et al.  CCL: a portable and tunable collective communication library for scalable parallel computers , 1994, Proceedings of 8th International Parallel Processing Symposium.

[25]  Leslie G. Valiant,et al.  Direct Bulk-Synchronous Parallel Algorithms , 1992, J. Parallel Distributed Comput..

[26]  David A. Bader,et al.  On the design and analysis of practical parallel algorithms for combinatorial problems with applications to image processing , 1996 .

[27]  Michael Kaufmann,et al.  Derandomizing algorithms for routing and sorting on meshes , 1994, SODA '94.

[28]  Bülent Abali,et al.  Balanced Parallel Sort on Hypercube Multiprocessors , 1993, IEEE Trans. Parallel Distributed Syst..

[29]  Shahid H. Bokhari,et al.  Complete exchange on a circuit switched mesh , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[30]  Torsten Suel,et al.  Efficient communication using total-exchange , 1995, Proceedings of 9th International Parallel Processing Symposium.

[31]  Yuh-Dauh Lyuu,et al.  Total exchange on a reconfigurable parallel architecture , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[32]  D. S. Scott,et al.  Efficient All-to-All Communication Patterns in Hypercube and Mesh Topologies , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[33]  David A. Bader,et al.  Parallel algorithms for image histogramming and connected components with an experimental study (extended abstract) , 1995, PPOPP '95.

[34]  David A. Bader,et al.  Parallel Algorithms for Image Histogramming and Connected Components with an Experimental Study , 1996, J. Parallel Distributed Comput..

[35]  William Carlson,et al.  AC for the T3D , 1995 .

[36]  Vassilios V. Dimakopoulos,et al.  Optimal total exchange in linear arrays and rings , 1994, Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN).

[37]  David R. O'Hallaron,et al.  An architecture for optimal all-to-all personalized communication , 1994, SPAA '94.

[38]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[39]  Sanjay Ranka,et al.  Many-to-many personalized communication with bounded traffic , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[40]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[41]  Shahid H. Bokhari,et al.  Multiphase Complete Exchange on a Circuit Switched Hypercube , 1994, ICPP.

[42]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[43]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[44]  Sanjay Ranka,et al.  Distributed Scheduling of Unstructured Collective Communication on the CM-5 , 1994, Parallel Process. Lett..