Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades

The Cell Broadband Engine (Cell BE) is a heterogeneous multi-core processor specifically designed to exploit thread-level parallelism. Its memory model comprehends a common shared main memory and eight small private local memories. Programming of the Cell BE involves dealing with multiple threads and explicit data movement strategies through DMAs which make the task very challenging. This situation gets even worse when dual Cell-based blades are considered. In this context, fast and efficient collective primitives are indispensable to reduce complexity and optimize performance. In this paper, we describe the design and implementation of three collective operations: barrier, broadcast and reduce. Their design takes into consideration the architectural peculiarities and asymmetries of dual Cell-based blades. Meanwhile, their implementation requires minimal resources, a signal register and a buffer. Experimental results show low latencies and high bandwidths, synchronization latency of 637 ns, broadcast bandwidth of 38.33 GB/s for 16 KB messages, and reduce latency of 1535 ns with 32 floats , on a dual Cell-based blade with 16 SPEs.

[1]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[2]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[3]  Mitsuhisa Sato,et al.  Guest Editors Introduction: Special Issue on OpenMP , 2008, International Journal of Parallel Programming.

[4]  Ashwini K. Nanda,et al.  Cell/B.E. blades: Building blocks for scalable, real-time, interactive, and digital media servers , 2007, IBM J. Res. Dev..

[5]  Toshio Nakatani,et al.  MPI microtask for programming the Cell Broadband EngineTM processor , 2006, IBM Syst. J..

[6]  Ashok Srinivasan,et al.  Optimization of Collective Communication in Intra-cell MPI , 2007, HiPC.

[7]  M.D. McCool,et al.  Scalable Programming Models for Massively Multicore Processors , 2008, Proceedings of the IEEE.

[8]  Dipl.-Inf. Torsten Hoefler,et al.  A Survey of Barrier Algorithms for Coarse Grained Supercomputers , 2005 .

[9]  Kathryn M. O'Brien,et al.  Supporting OpenMP on cell , 2008 .

[10]  José L. Abellán,et al.  Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades , 2008, ICCS.

[11]  Dhabaleswar K. Panda,et al.  Efficient and scalable barrier over Quadrics and Myrinet with a new NIC-based collective message passing protocol , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[12]  Dhabaleswar K. Panda,et al.  NIC-based reduction algorithms for large-scale clusters , 2006, Int. J. High Perform. Comput. Netw..

[13]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[14]  Murali Krishna,et al.  A Buffered-Mode MPI Implementation for the Cell BETM Processor , 2007, International Conference on Computational Science.