Power-efficient prefetching for embedded processors

Because of stringent power constraints, aggressive latency-hiding approaches, such as prefetching, are absent in the state-of-the-art embedded processors. There are two main reasons that make prefetching power inefficient. First, compiler-inserted prefetch instructions increase code size and, therefore, could increase I-cache power. Second, inaccurate prefetching (especially for hardware prefetching) leads to high D-cache power consumption because of useless accesses. In this work, we show that it is possible to support power-efficient prefetching through bit-differential offset assignment. We target the prefetching of relocatable stack variables with a high degree of precision. By assigning the offsets of stack variables in such a way that most consecutive addresses differ by 1 bit, we can prefetch them with compact prefetch instructions to save I-cache power. The compiler first generates an access graph of consecutive memory references and then attempts a layout of the memory locations in the smallest hypercube. Each dimension of the hypercube represents a 1-bit differential addressing. The embedding is carried out in as compact a hypercube as possible in order to save memory space. Each load/store instruction carries a hint regarding prefetching the next memory reference by encoding its differential address with respect to the current one. To reduce D-cache power cost, we further attempt to assign offsets so that most of the consecutive accesses map to the same cache line. Our prefetching is done using a one entry line buffer [Wilson et al. 1996]. Consequently, many look-ups in D-cache reduce to incremental ones. This results in D-cache activity reduction and power savings. Our prefetcher requires both compiler and hardware support. In this paper, we provide implementation on the processor model close to ARM with small modification to the ISA. We tackle issues such as out-of-order commit, predication, and speculation through simple modifications to the processor pipeline on noncritical paths. Our goal in this work is to boost performance while maintaining/lowering power consumption. Our results show 12% speedup and slight power reduction. The runtime virtual space loss for stack and static data is about 11.8%.

[1]  Santosh Pande,et al.  Storage assignment optimizations through variable coalescence for embedded processors , 2003 .

[2]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[3]  Olivier Temam,et al.  MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[4]  T. Ozawa,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[5]  Santosh Pande,et al.  Storage assignment optimizations through variable coalescence for embedded processors , 2003, LCTES '03.

[6]  Rainer Leupers,et al.  Algorithms for address assignment in DSP code generation , 1996, ICCAD 1996.

[7]  Kurt Keutzer,et al.  Storage assignment to decrease code size , 1996, TOPL.

[8]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[9]  Amit Rao,et al.  Storage assignment optimizations to generate compact and efficient code on embedded DSPs , 1999, PLDI '99.

[10]  Kenneth M. Wilson,et al.  Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[11]  Krste Asanovic,et al.  Direct addressed caches for reduced power consumption , 2001, MICRO.

[12]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[13]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[14]  Sangyeun Cho,et al.  Decoupling local variable accesses in a wide-issue superscalar processor , 1999, ISCA.

[15]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[16]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[17]  E. Witchel,et al.  Direct addressed caches for reduced power consumption , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[18]  Mikko H. Lipasti,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[19]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[20]  M. Smelyanskiy,et al.  Stack value file: custom microarchitecture for the stack , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[21]  Santosh Pande,et al.  Power-efficient prefetching via bit-differential offset assignment on embedded processors , 2004, LCTES '04.

[22]  Gadi Haber,et al.  Optimization opportunities created by global data reordering , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[23]  Reiner Kolla,et al.  Spanning tree based state encoding for low power dissipation , 1999, Design, Automation and Test in Europe Conference and Exhibition, 1999. Proceedings (Cat. No. PR00078).

[24]  Nikil D. Dutt,et al.  Low-power memory mapping through reducing address bus activity , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[25]  Mahmut T. Kandemir,et al.  Power protocol: reducing power dissipation on off-chip data buses , 2002, MICRO.

[26]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[27]  Alan Wagner,et al.  Embedding Trees in a Hypercube is NP-Complete , 1990, SIAM J. Comput..

[28]  Simon Segars Low power design techniques for microprocessors , 2000 .

[29]  Mikko H. Lipasti,et al.  Cache miss heuristics and preloading techniques for general-purpose programs , 1995, MICRO 28.