Efficiency and scalability of barrier synchronization on NoC based many-core architectures

Interconnects based on Networks-on-Chip are an appealing solution to address future microprocessor designs where, very likely, hundreds of cores will be connected on a single chip. A fundamental role in highly parallelized applications running on many-core architectures will be played by barrier primitives used to synchronize the execution of parallel processes. This paper focuses on the analysis of the efficiency and scalability of different barrier implementations in many-core architectures based on NoCs. Several message passing barrier implementations based on four algorithms (all-to-all, master-slave, butterfly and tree) have been implemented and evaluated for a single-chip target architecture composed of a variable number of cores (from 4 to 128) and different network topologies (mesh, torus, ring, clustered-ring and fat-tree). Using a cycle-accurate simulator, we show the scalability of each barrier for every NoC topology, analyzing and comparing theoretical with real behaviors. We observed that some barrier algorithms, when implemented in hardware or software, show a different scaling behavior with respect to those theoretically expected. We evaluate the efficiency of each combination topology-barrier, demonstrating that, in many cases, simple network topologies can be more efficient than complex and highly connected topologies.

[1]  Dhabaleswar K. Panda,et al.  Efficient and scalable barrier over Quadrics and Myrinet with a new NIC-based collective message passing protocol , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[2]  Gianluca Palermo,et al.  Efficient Synchronization for Embedded On-Chip Multiprocessors , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Luca Benini,et al.  Lightweight barrier-based parallelization support for non-cache-coherent MPSoC platforms , 2007, CASES '07.

[4]  Simon W. Moore,et al.  Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[5]  Fabrizio Petrini,et al.  Hardware- and software-based collective communication on the Quadrics network , 2001, Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.

[6]  Mahmut T. Kandemir,et al.  Exploiting barriers to optimize power consumption of CMPs , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[7]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[8]  Fernando Gustavo Tinetti,et al.  Parallel programming: techniques and applications using networked workstations and parallel computers. Barry Wilkinson, C. Michael Allen , 2000 .

[9]  Luca Benini,et al.  Networks on Chips : A New SoC Paradigm , 2022 .

[10]  Gianluca Palermo,et al.  PIRATE: A Framework for Power/Performance Exploration of Network-on-Chip Architectures , 2004, PATMOS.

[11]  Vincenzo Catania,et al.  A methodology for design of application specific deadlock-free routing algorithms for NoC systems , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[12]  Keith D. Underwood,et al.  The implications of working set analysis on supercomputing memory hierarchy design , 2005, ICS '05.

[13]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[14]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .