The Perception Processor

Recognizing speech, gestures, and visual features are important interface capabilities for future embedded mobile systems. Unfortunately the real-time performance requirements of complex perception applications cannot be met by current embedded processors and often even exceed the capability of high performance microprocessors. The energy budget of current high performance processors is infeasible in the embedded space. The normal approach is to resort to a custom ASIC to meet performance and energy constraints. However ASICs incur expensive and lengthy design cycles. They are so specialized that they are unable to support multiple applications or even evolutionary improvements in a single application. This dissertation introduces a VLIW perception processor that uses a combination of clustered function units, compiler controlled data-flow and compiler controlled clock-gating in conjunction with hardware support for modulo scheduling, address generation units and a scratch-pad memory system to achieve very high performance for perceptual algorithms at low energy consumption. The architecture is evaluated using benchmark algorithms taken from complex speech and visual feature recognition, security, and signal processing domains. Since energy and delay are common design trade-offs, the energy-delay product of a CMOS implementation of the perception processor is compared against ASICs and general purpose processors. Using a combination of Spice simulations, real processor power measurements and architecture simulation it is shown that the perception processor running at 1 GHz clock frequency outperforms a 2.4 GHz Pentium 4 by a factor of 1.75. While delivering this performance it simultaneously achieves 159 times better energy delay product than a low power Intel XScale embedded processor. The perception processor makes sophisticated real-time perception applications possible within an energy budget that is commensurate with the embedded space, a task that is impossible with current embedded processors.

[1]  Jihong Kim,et al.  Power-aware modulo scheduling for high-performance VLIW processors , 2001, ISLPED '01.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  P. Faraboschi,et al.  Lx: a technology platform for customizable VLIW embedded processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[4]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[5]  J.H. Tseng,et al.  Energy-efficient register access , 2000, Proceedings 13th Symposium on Integrated Circuits and Systems Design (Cat. No.PR00843).

[6]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.

[7]  Mark Horowitz,et al.  Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[8]  Ali Ibrahim,et al.  Perception Coprocessors for Embedded Systems , 2003, ESTImedia.

[9]  Charles C. Weems The second generation image understanding architecture and beyond , 1993, 1993 Computer Architectures for Machine Perception.

[10]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[11]  Kwabena Boahen,et al.  Retinomorphic chips that see quadruple images , 1999, Proceedings of the Seventh International Conference on Microelectronics for Neural, Fuzzy and Bio-Inspired Systems.

[12]  Mark C. Johnson,et al.  Leakage control with efficient use of transistor stacks in single threshold CMOS , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[13]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[14]  Gustavo de Veciana,et al.  Application-specific clustered VLIW datapaths: early exploration on a parameterized design space , 2002, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[15]  Michael D. Smith,et al.  Boosting beyond static scheduling in a superscalar processor , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[16]  Mosur Ravishankar,et al.  Efficient Algorithms for Speech Recognition. , 1996 .

[17]  Jeffry T. Russell,et al.  Software power estimation and optimization for high performance, 32-bit embedded processors , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[18]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[19]  B. Mathew,et al.  A characterization of visual feature recognition , 2003, 2003 IEEE International Conference on Communications (Cat. No.03CH37441).

[20]  Michael F. P. O'Boyle,et al.  OCEANS: Optimizing Compilers for Embedded Applications , 1997, Euro-Par.

[21]  Jenq Kuen Lee,et al.  Compiler optimization on instruction scheduling for low power , 2000, ISSS '00.

[22]  William C. Athas,et al.  Compact models for estimating microprocessor frequency and power , 2002, ISLPED '02.

[23]  Vivek Tiwari,et al.  Reducing power in high-performance microprocessors , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[24]  Narendra Ahuja,et al.  Detecting Faces in Images: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[26]  Brian Kingsbury,et al.  SPERT-II: a vector microprocessor system and its application to large problems in backpropagation training , 1996, Proceedings of Fifth International Conference on Microelectronics for Neural Networks.

[27]  Vincent Rijmen,et al.  The Block Cipher Rijndael , 1998, CARDIS.

[28]  D. Hammerstrom,et al.  A VLSI architecture for high-performance, low-cost, on-chip learning , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[29]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Pietro Perona,et al.  A novel system architecture for real-time low-level vision , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).

[31]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[32]  Rainer Leupers,et al.  Instruction scheduling for clustered VLIW DSPs , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[33]  Zhen Fang,et al.  A low-power accelerator for the SPHINX 3 speech recognition system , 2003, CASES '03.

[34]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[35]  Krste Asanovic,et al.  Parallel neural network training on Multi-Spert , 1997, Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing.

[36]  Shivali Srivastava FAST GAUSSIAN EVALUATIONS IN LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 2002 .

[37]  Magne Hallstein Johnsen,et al.  A VLSI implementation of PDF computations in HMM based speech recognition , 1996, Proceedings of Digital Processing Applications (TENCON '96).

[38]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[39]  Majid Sarrafzadeh,et al.  A super-scheduler for embedded reconfigurable systems , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[40]  A. Bertran,et al.  Face Detection Project Report , 2002 .

[41]  John Wawrzynek,et al.  Adapting software pipelining for reconfigurable computing , 2000, CASES '00.

[42]  Frank Mueller,et al.  Handling Irreducible Loops: Optimized Node Splitting vs. DJ-Graphs , 2001, Euro-Par.

[43]  Gregory D. Hager,et al.  X Vision: A Portable Substrate for Real-Time Vision Applications , 1998, Comput. Vis. Image Underst..

[44]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[45]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[46]  Margarida F. Jacome,et al.  CALiBeR: a software pipelining algorithm for clustered embedded VLIW processors , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[47]  Alex Pentland,et al.  Face recognition using eigenfaces , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  Jan M. Rabaey,et al.  Evaluation of a Low-Power Reconfigurable DSP Architecture , 1998, IPPS/SPDP Workshops.

[49]  Yu Cao,et al.  New paradigm of predictive MOSFET and interconnect modeling for early circuit simulation , 2000, Proceedings of the IEEE 2000 Custom Integrated Circuits Conference (Cat. No.00CH37044).

[50]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[51]  Vivek Sarkar,et al.  Baring It All to Software: Raw Machines , 1997, Computer.

[52]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[53]  Neil Weste,et al.  Principles of CMOS VLSI Design , 1985 .

[54]  Roberto Bisiani,et al.  A hardware accelerator for speech recognition algorithms , 1986, ISCA '86.

[55]  André DeHon,et al.  DPGA-coupled microprocessors: commodity ICs for the early 21st Century , 1994, Proceedings of IEEE Workshop on FPGA's for Custom Computing Machines.

[56]  Sharad Malik,et al.  Instruction level power analysis and optimization of software , 1996, Proceedings of 9th International Conference on VLSI Design.

[57]  S. Young Large Vocabulary Continuous Speech Recognition : a ReviewSteve , 1996 .

[58]  Ashish Verma,et al.  LATE INTEGRATION IN AUDIO-VISUAL CONTINUOUS SPEECH RECOGNITION , 1999 .

[59]  Seth Copen Goldstein,et al.  Fast compilation for pipelined reconfigurable fabrics , 1999, FPGA '99.

[60]  Henk Corporaal,et al.  MOVE: a framework for high-performance processor design , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[61]  Tao Xiong,et al.  A combined SVM and LDA approach for classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[62]  Bruce A. Draper,et al.  The CSU Face Identification Evaluation System: Its Purpose, Features, and Structure , 2003, ICVS.

[63]  W.-C. Fang,et al.  A system-on-chip design of a low-power smart vision system , 1998, 1998 IEEE Workshop on Signal Processing Systems. SIPS 98. Design and Implementation (Cat. No.98TH8374).

[64]  Alain J. Martin,et al.  ET 2 : a metric for time and energy efficiency of computation , 2002 .

[65]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[66]  Keshab K. Parhi,et al.  Low power synthesis of dual threshold voltage CMOS VLSI circuits , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[67]  Kamran Eshraghian,et al.  Principles of CMOS VLSI Design: A Systems Perspective , 1985 .

[68]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[69]  Trevor Pering,et al.  Dynamic Voltage Scaling and the Design of a Low-Power Microprocessor System , 1998 .

[70]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[71]  Pradeep K. Dubey,et al.  Some fast speech processing algorithms using AltiVec technology , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[72]  A. Tsai,et al.  PipeRench: A virtualized programmable datapath in 0.18 micron technology , 2002, Proceedings of the IEEE 2002 Custom Integrated Circuits Conference (Cat. No.02CH37285).

[73]  Tajana Simunic,et al.  A low-power, fixed-point, front-end feature extraction for a distributed speech recognition system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74]  Gerhard Fettweis,et al.  Dynamic Codewidth Reduction for VLIW Instruction Set Architectures in Digital Signal Processors , 1996 .

[75]  Doug Burger,et al.  A characterization of speech recognition on modern computer systems , 2001 .

[76]  Christof Koch,et al.  An analog vlsi motion sensor based on the fly visual system , 2000 .

[77]  Viktor K. Prasanna,et al.  High-performance computing for vision , 1996, Proc. IEEE.

[78]  Marc Campbell Evaluating ASIC, DSP, and RISC Architectures for Embedded Applications , 1998, LCTES.

[79]  Ying Fai Tong,et al.  Minimizing Floating-Point Power Dissipation Via Bitwidth Reduction , 2006 .

[80]  Marco Ferretti Multi-media extensions in super-pipelined micro-architectures. A new case for SIMD processing? , 2000, Proceedings Fifth IEEE International Workshop on Computer Architectures for Machine Perception.

[81]  Ronny Krashinsky,et al.  Microprocessor energy characterization and optimization through fast, accurate, and flexible simulation , 2001 .

[82]  Ricardo E. Gonzalez,et al.  Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.

[83]  Alex Pentland,et al.  Looking at People: Sensing for Ubiquitous and Wearable Computing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[84]  Ruby B. Lee,et al.  Challenges to Combining General-Purpose and Multimedia Processors , 1997, Computer.

[85]  Viktor K. Prasanna,et al.  Parallel Architectures and Algorithms for Image Component Labeling , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[86]  Anthony Correale,et al.  Overview of the power minimization techniques employed in the IBM PowerPC 4xx embedded controllers , 1995, ISLPED '95.

[87]  Paul Wielage,et al.  Xetal: a low-power high-performance smart camera processor , 2001, ISCAS 2001. The 2001 IEEE International Symposium on Circuits and Systems (Cat. No.01CH37196).

[88]  Lex Augusteijn,et al.  Instruction Scheduling for TriMedia , 1999, J. Instr. Level Parallelism.

[89]  Wolfgang Karl,et al.  Some Design Aspects for VLIW Architectures Exploiting Fine - Grained Parallelism , 1993, PARLE.

[90]  Mika Laaksonen,et al.  Using the skin locus to cope with chang-ing illumination conditions in color-based face tracking , 2000 .

[91]  Andreas Krall,et al.  Minimizing cost of local variables access for DSP-processors , 1999, LCTES '99.

[92]  Shih-Lien Lu,et al.  Performance Analysis of Speech Recognition Software , 2002 .

[93]  Larry L. Biro,et al.  Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).