Reducing power while increasing performance with supercisc

Multiprocessor Systems on Chips (MPSoCs) have become a popular architectural technique to increase performance. However, MPSoCs may lead to undesirable power consumption characteristics for computing systems that have strict power budgets, such as PDAs, mobile phones, and notebook computers. This paper presents the super-complex instruction-set computing (SuperCISC) Embedded Processor Architecture and, in particular, investigates performance and power consumption of this device compared to traditional processor architecture-based execution. SuperCISC is a heterogeneous, multicore processor architecture designed to exceed performance of traditional embedded processors while maintaining a reduced power budget compared to low-power embedded processors. At the heart of the SuperCISC processor is a multicore VLIW (Very Large Instruction Word) containing several homogeneous execution cores/functional units. In addition, complex and heterogeneous combinational hardware function cores are tightly integrated to the core VLIW engine providing an opportunity for improved performance and reduced energy consumption. Our SuperCISC processor core has been synthesized for both a 90-nm Stratix II Field Programmable Gate Aray (FPGA) and a 160-nm standard cell Application-Specific Integrated Circuit (ASIC) fabrication process from OKI, each operating at approximately 167 MHz for the VLIW core. We examine several reasons for speedup and power improvement through the SuperCISC architecture, including predicated control flow, cycle compression, and a reduction in arithmetic power consumption, which we call power compression. Finally, testing our SuperCISC processor with multimedia and signal-processing benchmarks, we show how the SuperCISC processor can provide performance improvements ranging from 7X to 160X with an average of 60X, while also providing orders of magnitude of power improvements for the computational kernels. The power improvements for our benchmark kernels range from just over 40X to over 400X, with an average savings exceeding 130X. By combining these power and performance improvements, our total energy improvements all exceed 1000X. As these savings are limited to the computational kernels of the applications, which often consume approximately 90% of the execution time, we expect our savings to approach the ideal application improvement of 10X.

[1]  Stamatis Vassiliadis,et al.  An 8x8 IDCT Implementation on an FPGA-Augmented TriMedia , 2001, The 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'01).

[2]  Olivier Sentieys,et al.  Multi-algorithm ASIP synthesis and power estimation for DSP applications , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[3]  Kiyoung Choi,et al.  Energy-efficient instruction set synthesis for application-specific processors , 2003, ISLPED '03.

[4]  Alex K. Jones,et al.  A MATLAB compiler for distributed, heterogeneous, reconfigurable computing systems , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[5]  A. Tsai,et al.  PipeRench: A virtualized programmable datapath in 0.18 micron technology , 2002, Proceedings of the IEEE 2002 Custom Integrated Circuits Conference (Cat. No.02CH37285).

[6]  André DeHon,et al.  MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources , 1996, 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[7]  Niraj K. Jha,et al.  Behavioral synthesis for low power , 1994, Proceedings 1994 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[8]  Farid N. Najm,et al.  A survey of power estimation techniques in VLSI circuits , 1994, IEEE Trans. Very Large Scale Integr. Syst..

[9]  Anantha P. Chandrakasan,et al.  Low-power CMOS digital design , 1992 .

[10]  Shen Chih Tung,et al.  An 88-way multiprocessor within an FPGA with customizable instructions , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[11]  K JonesAlex,et al.  Reducing power while increasing performance with supercisc , 2006 .

[12]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[13]  C. C. Jong,et al.  Exploring module selection space for architectural synthesis of low power designs , 1997, Proceedings of 1997 IEEE International Symposium on Circuits and Systems. Circuits and Systems in the Information Age ISCAS '97.

[14]  William J. Dally,et al.  VLSI design and verification of the Imagine processor , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[15]  Sharad Malik,et al.  The design of dynamically reconfigurable datapath coprocessors , 2004, TECS.

[16]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[17]  Jordi Cortadella,et al.  High-level synthesis techniques for reducing the activity of functional units , 1995, ISLPED '95.

[18]  John Wawrzynek,et al.  The Garp Architecture and C Compiler , 2000, Computer.

[19]  Niraj K. Jha,et al.  Low power system scheduling and synthesis , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[20]  Luca Benini,et al.  Selective instruction compression for memory energy reduction in embedded systems , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[21]  K.J. O'Connor,et al.  Design issues for very-long-instruction-word VLSI video signal processors , 1996, VLSI Signal Processing, IX.

[22]  Massoud Pedram,et al.  Module assignment for low power , 1996, Proceedings EURO-DAC '96. European Design Automation Conference with EURO-VHDL '96 and Exhibition.

[23]  Farid N. Najm,et al.  Power macromodeling for high level power estimation , 1997, DAC.

[24]  Sumit Gupta,et al.  SPARK: A Parallelizing Approach to the High-Level Synthesis of Digital Circuits , 2004 .

[25]  Scott Hauck,et al.  The Chimaera reconfigurable functional unit , 1997, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[26]  Sharad Malik,et al.  Exploiting operation level parallelism through dynamically reconfigurable datapaths , 2002, DAC '02.

[27]  Marios C. Papaefthymiou,et al.  A static power estimation methodology for IP-based design , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[28]  R. Govindarajan,et al.  Area and Power Reduction of Embedded DSP Systems using Instruction Compression and Re-configurable Encoding , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[29]  Alex K. Jones,et al.  Rapid VLIW Processor Customization for Signal Processing Applications Using Combinational Hardware Functions , 2006, EURASIP J. Adv. Signal Process..

[30]  Bennett B. Goldberg,et al.  Trimaran - An Infrastructure for Compiler Research in Instruction Level Parallelism , 1998 .

[31]  H. Meyr,et al.  Power reduction for ASIPS: a case study , 2001, 2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578).

[32]  Alex K. Jones,et al.  PACT HDL: a C compiler targeting ASICs and FPGAs with power and performance optimizations , 2002, CASES '02.

[33]  Ricardo E. Gonzalez,et al.  Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.

[34]  Niraj K. Jha,et al.  IMPACT: A high-level synthesis system for low power control-flow intensive circuits , 1998, Proceedings Design, Automation and Test in Europe.

[35]  Prithviraj Banerjee,et al.  Overview of a compiler for synthesizing MATLAB programs onto FPGAs , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[36]  Alex K. Jones,et al.  Behavioral synthesis of data-dominated circuits for minimal energy implementation , 2005, 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design.

[37]  Carl Ebeling,et al.  RaPiD - Reconfigurable Pipelined Datapath , 1996, FPL.

[38]  Kaushik Roy,et al.  A power macromodeling technique based on power sensitivity , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[39]  Darin Petkov,et al.  Automatic generation of application specific processors , 2003, CASES '03.

[40]  Alex K. Jones,et al.  An FPGA-based VLIW processor with custom hardware execution , 2005, FPGA '05.

[41]  Herman Schmit,et al.  Efficient application representation for HASTE: Hybrid Architectures with a Single, Transformable Executable , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[42]  M. Papaefthymiou,et al.  A Markov chain sequence generator for power macromodeling , 2002, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.