A 280 mV-to-1.1 V 256b Reconfigurable SIMD Vector Permutation Engine With 2-Dimensional Shuffle in 22 nm Tri-Gate CMOS

An ultra-low voltage reconfigurable 4-way to 32-way SIMD vector permutation engine is fabricated in 22 nm tri-gate bulk CMOS, consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clock-less static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250 mV across PVT variations with a wide dynamic operating range of 280 mV-1.1 V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates, and ultra-low voltage split-output (ULVS) level shifters improving logic VMIN by 150 mV, while enabling peak energy efficiency of 585 GOPS/W measured at 260 mV, 50 °C. The permutation engine achieves: (i) nominal register file performance of 1.8 GHz, 106 mW measured at 0.9 V, 50 °C, (ii) robust register file functionality measured down to 280 mV with peak energy efficiency of 154 GOPS/W, (iii) scalable permute crossbar performance of 2.9 GHz, 69 mW measured at 1.1 V, 50 °C with sub-threshold operation at 240 mV, 10 MHz consuming 19 μW, and (iv) a 64b 4 × 4 matrix transpose algorithm and AoS to SoA conversion with 40%-53% energy savings and 25%-42% improved peak throughput measured at 1.8 GHz, 0.9 V.

[1]  A. Chandrakasan,et al.  A 180-mV subthreshold FFT processor using a minimum energy design methodology , 2005, IEEE Journal of Solid-State Circuits.

[2]  William J. Bowhill,et al.  A 32 nm, 3.1 Billion Transistor, 12 Wide Issue Itanium® Processor for Mission-Critical Servers , 2012, IEEE Journal of Solid-State Circuits.

[3]  Bo Zhai,et al.  Exploring Variability and Performance in a Sub-200-mV Processor , 2008, IEEE Journal of Solid-State Circuits.

[4]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[5]  Kei Ito,et al.  A 512GOPS Fully-Programmable Digital Image Processor with full HD 1080p Processing Capabilities , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[6]  C. Auth,et al.  A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[7]  Uming Ko,et al.  A 28 nm 0.6 V Low Power DSP for Mobile Applications , 2012, IEEE Journal of Solid-State Circuits.

[8]  Ruby B. Lee Accelerating multimedia with enhanced microprocessors , 1995, IEEE Micro.

[9]  Kevin Zhang,et al.  A 4.6GHz 162Mb SRAM design in 22nm tri-gate CMOS technology with integrated active VMIN-enhancing assist circuitry , 2012, 2012 IEEE International Solid-State Circuits Conference.

[10]  Ramy E. Aly,et al.  A Family of 32 nm IA Processors , 2011, IEEE Journal of Solid-State Circuits.

[11]  P. Pirsch,et al.  An SoC with two multimedia DSPs and a RISC core for video compression applications , 2004, 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519).

[12]  Sanu Mathew,et al.  A 300 mV 494GOPS/W Reconfigurable Dual-Supply 4-Way SIMD Vector Processing Accelerator in 45 nm CMOS , 2009, IEEE Journal of Solid-State Circuits.

[13]  Ruby B. Lee Subword parallelism with MAX-2 , 1996, IEEE Micro.

[14]  Chris Auth,et al.  22-nm fully-depleted tri-gate CMOS transistors , 2012, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference.

[15]  A.P. Chandrakasan,et al.  A 65 nm Sub-$V_{t}$ Microcontroller With Integrated SRAM and Switched Capacitor DC-DC Converter , 2008, IEEE Journal of Solid-State Circuits.

[16]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[17]  Sanu Mathew,et al.  A 32nm 8.3GHz 64-entry × 32b variation tolerant near-threshold voltage register file , 2010, 2010 Symposium on VLSI Circuits.

[18]  B. Flachs,et al.  A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[19]  M. Cooperman,et al.  CMOS gigabit-per-second switching , 1993 .

[20]  Vladimir M. Pentkovski,et al.  Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[21]  Sanu Mathew,et al.  A 320mV-to-1.2V on-die fine-grained reconfigurable fabric for DSP/media accelerators in 32nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[22]  David Blaauw,et al.  A 1.07 Tbit/s 128×128 swizzle network for SIMD processors , 2010, 2010 Symposium on VLSI Circuits.

[23]  J.A. Abraham,et al.  Design of Shifting and Permutation Units using LSDL Circuit Family , 2006, 2006 Fortieth Asilomar Conference on Signals, Systems and Computers.

[24]  S. Borkar,et al.  A 320 mV 56 μW 411 GOPS/Watt Ultra-Low Voltage Motion Estimation Accelerator in 65 nm CMOS , 2009, IEEE Journal of Solid-State Circuits.

[25]  Marcelo Yuffe,et al.  A fully integrated multi-CPU, GPU and memory controller 32nm processor , 2011, 2011 IEEE International Solid-State Circuits Conference.

[26]  Hiroshi Kawaguchi,et al.  6.3 A 0.5V, 400MHz, VDD-Hopping Processor with Zero-VTH FD-SOI Technology , 2003 .