Characterization and exploitation of narrow-width loads: the narrow-width cache approach

This paper exploits small-value locality to accelerate the execution of memory instructions. We find that narrow-width loads (NWLDs) --- loads with small-value operands of 8 bits or less --- comprise 26% of all executed loads across 40 applications of the SPEC benchmark suites. We establish that the frequency of NWLDs are almost independent of compiler and input data. We introduce narrow-width caches (NWC) to cache small-value memory words. NWCs provide a significant speedup for several memory-intensive applications with a negligible chip-area overhead. NWCs also reduce the overall energy dissipation and memory traffic.

[1]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[2]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[3]  Margaret Martonosi,et al.  Dynamically exploiting narrow width operands to improve processor power and performance , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[4]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[5]  R. Canal,et al.  Very low power pipelines using significance compression , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[6]  Krste Asanovic,et al.  Dynamic zero compression for cache energy reduction , 2000, MICRO 33.

[7]  Margaret Martonosi,et al.  Value-based clock gating and operation packing: dynamic strategies for improving processor power and performance , 2000, TOCS.

[8]  Mikko H. Lipasti,et al.  Silent Stores and Store Value Locality , 2001, IEEE Trans. Computers.

[9]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[10]  Jun Yang,et al.  Frequent value locality and its applications , 2002, TECS.

[11]  Jun Yang,et al.  Energy efficient Frequent Value data Cache design , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[12]  Gabriel H. Loh Exploiting data-width locality to increase superscalar execution bandwidth , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[13]  Glenn Reinman,et al.  Just say no: benefits of early cache miss determination , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[14]  David A. Wood,et al.  Adaptive cache compression for high-performance processors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[15]  Kanad Ghose,et al.  Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[16]  Aneesh Aggarwal,et al.  Restrictive compression techniques to increase level 1 cache capacity , 2005, 2005 International Conference on Computer Design.

[17]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[18]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[19]  Mateo Valero,et al.  An asymmetric clustered processor based on value content , 2005, ICS '05.

[20]  M. Ekman,et al.  A robust main-memory compression scheme , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[21]  Gilles Pokam,et al.  A case for a complexity-effective, width-partitioned microarchitecture , 2006, TACO.

[22]  Shyamkumar Thoziyoor,et al.  CACTI 5 . 1 , 2008 .

[23]  Per Stenström,et al.  Memory-Link Compression Schemes: A Value Locality Perspective , 2008, IEEE Transactions on Computers.

[24]  Per Stenström,et al.  Zero-Value Caches: Cancelling Loads that Return Zero , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.