A Design of Pipeline Chain Algorithm Based on Circuit Switching for MPI Broadcast Communication System

This paper proposes an algorithm and a hardware architecture for a broadcast communication which has the worst bottleneck among multiprocessor using distributed memory architectures. In conventional system, The pipelined broadcast algorithm is an algorithm which takes advantage of maximum bandwidth of communication bus. But unnecessary synchronization process are repeated, because the pipelined broadcast sends the data divided into many parts. In this paper, the MPI unit for pipeline chain algorithm based on circuit switching removing the redundancy of synchronization process was designed, the proposed architecture was evaluated by modeling it with systemC. Consequently, the performance of the proposed architecture was highly improved for broadcast communication up to 3.3 times that of systems using conventional pipelined broadcast algorithm, it can almost take advantage of the maximum bandwidth of transmission bus. Then, it was implemented with VerilogHDL, synthesized with TSMC 0.18um library and implemented into a chip. The area of synthesis results occupied 4,700 gates(2 input NAND gate) and utilization of total area is 2.4%. The proposed architecture achieves improvement in total performance of MPSoC occupying relatively small area.

[1]  Henry M. Levy,et al.  A comparison of message passing and shared memory architectures for data parallel programs , 1994, ISCA '94.

[2]  Danyao Wang,et al.  MPI as an abstraction for software-hardware interaction for HPRCs , 2008, 2008 Second International Workshop on High-Performance Reconfigurable Computing Technology and Applications.

[3]  Paul Chow,et al.  The challenges of using an embedded MPI for hardware-based processing nodes , 2009, 2009 International Conference on Field-Programmable Technology.

[4]  Philip Heidelberger,et al.  Optimization of MPI collective communication on BlueGene/L systems , 2005, ICS '05.

[5]  A. Skjellum,et al.  eMPI/eMPICH: embedding MPI , 1996, Proceedings. Second MPI Developer's Conference.

[6]  Robert A. van de Geijn,et al.  Building a high-performance collective communication library , 1994, Proceedings of Supercomputing '94.

[7]  Veljko M. Milutinovic,et al.  Hardware approaches to cache coherence in shared-memory multiprocessors, Part 1 , 1994, IEEE Micro.

[8]  R. Rabenseifner,et al.  Automatic MPI Counter Profiling of All Users: First Results on a CRAY T3E 900-512 , 2004 .

[9]  Paul Marchal,et al.  Flexible hardware/software support for message passing on a distributed shared memory architecture , 2005, Design, Automation and Test in Europe.

[10]  P. Stenstrom A survey of cache coherence schemes for multiprocessors , 1990, Computer.

[11]  Luca Benini,et al.  Networks on Chips : A New SoC Paradigm , 2022 .

[12]  Paul Chow,et al.  TMD-MPI: An MPI Implementation for Multiple Processors Across Multiple FPGAs , 2006, 2006 International Conference on Field Programmable Logic and Applications.

[13]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[14]  Veljko M. Milutinovic,et al.  Hardware approaches to cache coherence in shared-memory multiprocessors. 2 , 1994, IEEE Micro.

[15]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[16]  Robert A. van de Geijn,et al.  A Pipelined Broadcast for Multidimensional Meshes , 1995, Parallel Process. Lett..

[17]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[18]  D. G. Payne,et al.  Broadcasting on Meshes with Worm-hole Routing , 1996 .

[19]  Robert A. van de Geijn,et al.  Broadcasting on Meshes with Wormhole Routing , 1996, J. Parallel Distributed Comput..

[20]  Javier Castillo,et al.  Cluster architecture based on low cost reconfigurable hardware , 2008, 2008 International Conference on Field Programmable Logic and Applications.