A 320 mW 342 GOPS Real-Time Dynamic Object Recognition Processor for HD 720p Video Streams

A heterogeneous multi-core processor is proposed to achieve real-time dynamic object recognition on HD 720p video streams. The context-aware visual attention model is proposed to reduce the required computing power for HD object recognition based on enhanced attention accuracy. In order to realize real-time execution of the proposed algorithm, the processor adopts a 5-stage task-level pipeline that maximizes the utilization of its 31 heterogeneous cores, comprising four simultaneous multithreading feature extraction clusters, a cache-based feature matching processor and a machine learning engine. Dynamic resource management is applied to adaptively tune thread allocation and power management during execution based on the detected amount of tasks and hardware utilization to increase energy efficiency. As a result, the 32 mm2 chip, fabricated in 0.13 μm CMOS technology, achieves 30 frame/sec with 342 8-bit GOPS peak performance and 320 mW average power dissipation, which are a 2.72 times performance improvement and 2.54 times per-pixel energy reduction compared to the previous state-of-the-art.

[1]  Hoi-Jun Yoo,et al.  A 118.4 GB/s Multi-Casting Network-on-Chip With Hierarchical Star-Ring Combined Topology for Real-Time Object Recognition , 2010, IEEE Journal of Solid-State Circuits.

[2]  P. Groves,et al.  A 600 MHz VLIW DSP , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[3]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[5]  Hoi-Jun Yoo,et al.  A 320mW 342GOPS real-time moving object recognition processor for HD 720p video streams , 2012, 2012 IEEE International Solid-State Circuits Conference.

[6]  Hoi-Jun Yoo,et al.  A 57mW embedded mixed-mode neuro-fuzzy accelerator for intelligent multi-core processor , 2011, 2011 IEEE International Solid-State Circuits Conference.

[7]  Richard P. Kleihorst,et al.  Xetal-II: A Low-Power Massively-Parallel Processor for Video Scene Analysis , 2011, J. Signal Process. Syst..

[8]  Liang-Gee Chen,et al.  Tera-Scale Performance Machine Learning SoC (MLSoC) With Dual Stream Processor Architecture for Multimedia Content Analysis , 2009, IEEE Journal of Solid-State Circuits.

[9]  George Kornaros,et al.  Dynamic resource management in modern multicore SoCs by exposing NoC services , 2011, 6th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC).

[10]  Kei Ito,et al.  A 512GOPS Fully-Programmable Digital Image Processor with full HD 1080p Processing Capabilities , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[11]  H. Arakida,et al.  A power, performance scalable eight-cores media processor for mobile multimedia applications , 2008, 2008 IEEE Asian Solid-State Circuits Conference.

[12]  Fadi J. Kurdahi,et al.  Design and Implementation of the MorphoSys Reconfigurable Computing Processor , 2000, J. VLSI Signal Process..

[13]  Siddhartha S. Srinivasa,et al.  MOPED: A scalable and low latency object recognition and pose estimation system , 2010, 2010 IEEE International Conference on Robotics and Automation.

[14]  Markus Weinhardt,et al.  PACT XPP—A Self-Reconfigurable Data Processing Architecture , 2004, The Journal of Supercomputing.

[15]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[16]  Peter Pirsch,et al.  A multi-core SoC design for advanced image and video compression , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[17]  Rudy Lauwereins,et al.  ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix , 2003, FPL.

[18]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[19]  Kees Goossens,et al.  AEthereal network on chip: concepts, architectures, and implementations , 2005, IEEE Design & Test of Computers.

[20]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[21]  R.P. Kleihorst,et al.  Xetal-II: A 107 GOPS, 600 mW Massively Parallel Processor for Video Scene Analysis , 2008, IEEE Journal of Solid-State Circuits.

[22]  William J. Dally,et al.  A Programmable 512 GOPS Stream Processor for Signal, Image, and Video Processing , 2008, IEEE J. Solid State Circuits.

[23]  Hans Jürgen Mattausch,et al.  A scalable massively parallel processor for real-time image processing , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[24]  Hoi-Jun Yoo,et al.  A 201.4 GOPS 496 mW Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine , 2009, IEEE Journal of Solid-State Circuits.

[25]  Gerard J. M. Smit,et al.  Energy-Efficiency of the MONTIUM Reconfigurable Tile Processor , 2004, ERSA.

[26]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[27]  Hoi-Jun Yoo,et al.  A 345 mW Heterogeneous Many-Core Processor With an Intelligent Inference Engine for Robust Object Recognition , 2011, IEEE Journal of Solid-State Circuits.

[28]  William J. Dally,et al.  A Programmable 512 GOPS Stream Processor for Signal, Image, and Video Processing , 2007, IEEE Journal of Solid-State Circuits.

[29]  Hoi-Jun Yoo,et al.  A 92m W 76.8GOPS vector matching processor with parallel Huffman decoder and query re-ordering buffer for real-time object recognition , 2010, 2010 IEEE Asian Solid-State Circuits Conference.

[30]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[31]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[32]  B. Flachs,et al.  The microarchitecture of the synergistic processor for a cell processor , 2006, IEEE Journal of Solid-State Circuits.