A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS

This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25mm2 in 65nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2V.

[1]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[2]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[3]  Mitsuhisa Sato,et al.  OpenMP: parallel programming API for shared memory multiprocessors and on-chip multiprocessors , 2002, 15th International Symposium on System Synthesis, 2002..

[4]  Dajiang Zhou,et al.  An SDRAM controller optimized for high definition video coding application , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[5]  R.P. Kleihorst,et al.  Xetal-II: A 107 GOPS, 600 mW Massively Parallel Processor for Video Scene Analysis , 2008, IEEE Journal of Solid-State Circuits.

[6]  Zhuang Zhaowen,et al.  Instruction-Level Optimization of H.264 Encoder Using SIMD Instructions , 2006, 2006 International Conference on Communications, Circuits and Systems.

[7]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[8]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[9]  Masaki Nakagawa,et al.  A VLIW Vector Media Coprocessor With Cascaded SIMD ALUs , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[10]  William J. Dally,et al.  A Programmable 512 GOPS Stream Processor for Signal, Image, and Video Processing , 2007, IEEE Journal of Solid-State Circuits.

[11]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[12]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[13]  Dawei Huang,et al.  A 40 nm 16-Core 128-Thread SPARC SoC Processor , 2011, IEEE Journal of Solid-State Circuits.