A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture

The designs of high-performance processor architectures are moving toward the integration of a large number of multiple processing cores on a single chip. The IBM Cyclops-64 (C64) is a petaflop supercomputer built on multi-core system-on-a-chip technology. Each C64 chip employs a multistage pipelined crossbar switch as its on-chip interconnection network to provide high bandwidth and low latency communication between the 160 thread processing cores, the on-chip SRAM memory banks, and other components. In this paper, we present a study of the architecture and performance of the C64 on-chip interconnection network through simulation. Our experimental results provide observations on the network behavior: (1) Dedicated channels can be created between any output port to input port of the C64 crossbar with latency as low as 7 cycles. The C64 crossbar has the potential reach the full hardware bandwidth, and exhibit a non-blocking behavior; (2) The C64 crossbar is a stable network; (3) The network logic design appears to provide a reasonable opportunity for sharing the channel bandwidth between traffic in either direction; (4) A simple circular neighbor arbitration scheme can achieve competitive performance level comparing to the complex segmented LRU (least recently used) matrix arbitration scheme without losing the fairness. (5) Application-driven benchmarks provide comparable results to synthetic workloads.

[1]  Mark A. Franklin,et al.  VLSI Performance Comparison of Banyan and Crossbar Communications Networks , 1981, IEEE Transactions on Computers.

[2]  Michael K. Chen,et al.  Shangri-La: achieving high performance from compiled network applications while enabling ease of programming , 2005, PLDI '05.

[3]  Trevor Mudge,et al.  Probabilistic analysis of a crossbar switch , 1982, ISCA 1982.

[4]  Guang R. Gao,et al.  Performance portability on EARTH: a case study across several parallel architectures , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[5]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[6]  Guang R. Gao,et al.  Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip , 2006, CF '06.

[7]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[8]  Hirofumi Sakane,et al.  DIMES: an iterative emulation platform for Multiprocessor-System-On-Chip designs , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[9]  G. Gao,et al.  FAST : A Functionally Accurate Simulation Toolset for the Cyclops 64 Cellular Architecture , 2005 .

[10]  B. A. Makrucki,et al.  Probabilistic analysis of a crossbar switch , 1982, ISCA '82.

[11]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[12]  Liqiang Zhang On the Performance of Bus Interconnection for SOCs , 2002 .

[13]  Alan Gray,et al.  Deterministic Parallel Processing , 2006, International Journal of Parallel Programming.

[14]  Kunle Olukotun,et al.  A Single-Chip Multiprocessor , 1997, Computer.

[15]  Guang R. Gao,et al.  Toward a Software Infrastructure for the Cyclops-64 Cellular Architecture , 2006, 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS'06).

[16]  Guang R. Gao,et al.  TiNy threads: a thread virtual machine for the Cyclops64 cellular architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[17]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).