Low-cost scratchpad memory organizations using heterogeneous cell sizes for low-voltage operations

Abstract Modern digital signal processors (DSPs) execute diverse applications ranging from digital filters to video decoding. These applications have drastically different arithmetic precision and scratch pad memory (SPM) size requirements. To minimize power consumption, DSPs often support aggressive dynamic voltage/frequency scaling (DVFS) techniques, requiring on-chip memory, such as SPM, to operate at low voltages. However, increasing process variations with aggressive technology scaling have significantly increased the failure rate of on-chip memory designed with small transistors operating at low voltages. Consequently, designs must use either larger and/or more transistors to have memory cells satisfy a target minimum operating voltage ( V MIN ) under a failure rate constraint. Yet using larger and/or more transistors for the SPM, which consumes a large fraction of the chip area, is costly. In this paper, we first propose SPM designs that exploit (i) the characteristics of applications and (ii) the tradeoffs between memory cell size and V MIN . Our approach can reduce the SPMs chip area by up to 17% and V MIN by up to 52.5 mV. Second, we exploit the error-tolerant characteristics of some applications. Our proposed SPM can support lower V MIN with less mean square error than a conventional SPM with shortened word width. For error-sensitive applications that require high precision, we can lower V MIN at the cost of reduced memory capacity. This approach may negatively impact the performance of applications with large memory footprints. However, we demonstrate that such applications are typically constrained by their execution latency requirements and are likely to operate at higher voltages/frequencies than applications with smaller memory footprints to satisfy their real-time execution constraints.

[1]  Serge J. Belongie,et al.  SD-VBS: The San Diego Vision Benchmark Suite , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[2]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[3]  Madhu Mutyam,et al.  Working with process variation aware caches , 2007 .

[4]  Jason Schlessman,et al.  Accuracy-aware SRAM: A reconfigurable low power SRAM architecture for mobile multimedia applications , 2009, 2009 Asia and South Pacific Design Automation Conference.

[5]  Hsien-Hsin S. Lee,et al.  CoolPression - a hybrid significance compression technique for reducing energy in caches , 2004, IEEE International SOC Conference, 2004. Proceedings..

[6]  Scott A. Mahlke,et al.  Polymorphic Pipeline Array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Kaushik Roy,et al.  A process-tolerant cache architecture for improved yield in nanoscale technologies , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  Kaushik Roy,et al.  A voltage-scalable & process variation resilient hybrid SRAM architecture for MPEG-4 video processors , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[9]  W. Sansen,et al.  Physical modeling and prediction of the matching properties of MOSFETs , 2004, Proceedings of the 30th European Solid-State Circuits Conference (IEEE Cat. No.04EX850).

[10]  Hyunseok Lee,et al.  SODA: A High-Performance DSP Architecture for Software-Defined Radio , 2007, IEEE Micro.

[11]  Tajana Simunic,et al.  A low-power, fixed-point, front-end feature extraction for a distributed speech recognition system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  P. Groves,et al.  A 600 MHz VLIW DSP , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[13]  Scott A. Mahlke,et al.  Trimaran: An Infrastructure for Research in Instruction-Level Parallelism , 2004, LCPC.

[14]  N. Vallepalli,et al.  A 3-GHz 70-mb SRAM in 65-nm CMOS technology with integrated column-based dynamic power supply , 2005, IEEE Journal of Solid-State Circuits.

[15]  Kyu Ho Park,et al.  MetaCore: an application specific DSP development system , 1998, DAC.

[16]  Yehea I. Ismail,et al.  Accurate Estimation of SRAM Dynamic Stability , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[17]  Nam Sung Kim,et al.  Minimizing total area of low-voltage SRAM arrays through joint optimization of cell size, redundancy, and ECC , 2010, 2010 IEEE International Conference on Computer Design.

[18]  Lawrence Clark,et al.  Delay and Area Efficient First-level Cache Soft Error Detection and Correction , 2006, 2006 International Conference on Computer Design.

[19]  Rob A. Rutenbar,et al.  Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform , 2002, EURASIP J. Adv. Signal Process..

[20]  Sung Woo Chung,et al.  Selective wordline voltage boosting for caches to manage yield under process variations , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[21]  Nam Sung Kim,et al.  Scratchpad memory optimizations for digital signal processing applications , 2011, 2011 Design, Automation & Test in Europe.

[22]  Alexander V. Veidenbaum,et al.  Fast Speculative Address Generation and Way Caching for Reducing L1 Data Cache Energy , 2006, 2006 International Conference on Computer Design.

[23]  Nam Sung Kim,et al.  Frequency and yield optimization using power gates in power-constrained designs , 2009, ISLPED.

[24]  K. Ishibashi,et al.  A 65-nm SoC Embedded 6T-SRAM Designed for Manufacturability With Read and Write Operation Stabilizing Circuits , 2007, IEEE Journal of Solid-State Circuits.

[25]  Krisztián Flautner,et al.  SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip , 2008, CASES '08.

[26]  James E. Smith,et al.  Very low power pipelines using significance compression , 2000, MICRO 33.