Compiler-directed physical address generation for reducing dTLB power

Address translation using the Translation Lookaside Buffer (TLB) consumes as much as 16% of the chip power on some processors because of its high associativity and access frequency. While prior work has looked into optimizing this structure at the circuit and architectural levels, this paper takes a different approach of optimizing its power by reducing the number of data TLB (dTLB) lookups for data references. The main idea is to keep translations in a set of translation registers, and intelligently use them in software to directly generate the physical addresses without going through the dTLB. The software has to work within the confines of the translation registers provided by the hardware, and has to maximize the reuse of such translations to be effective. We propose strategies and code transformations for achieving this in array-based and pointer-based codes, looking to optimize data accesses. Results with a suite of Spec95 array-based and pointer-based codes show dTLB energy savings of up to 73% and 88%, respectively, compared to directly using the dTLB for all references. Despite the small increase in instructions executed with our mechanisms, the approach can in fact provide performance benefits in certain cases.

[1]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[2]  Srilatha Manne Low Power TLB Design for High Performance Microprocessors , 1997 .

[3]  Steve Alten,et al.  Omega Project , 1978, Encyclopedia of Parallel Computing.

[4]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[5]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[6]  Luca Benini,et al.  Monitoring system activity for OS-directed dynamic power management , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[7]  Norman P. Jouppi,et al.  CACTI 2.0: An Integrated Cache Timing and Power Model , 2002 .

[8]  Monica S. Lam,et al.  The Multiprocessor as a General-Purpose Processor: A Software Perspective , 1996 .

[9]  Tomás Lang,et al.  Reducing TLB power requirements , 1997, Proceedings of 1997 International Symposium on Low Power Electronics and Design.

[10]  Luca Benini,et al.  System-level power estimation and optimization , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[11]  Margaret Martonosi,et al.  Dynamic thermal management for high-performance microprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[12]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[13]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Seh-Woong Jeong,et al.  A Low Power TLB Structure for Embedded Systems , 2002, IEEE Computer Architecture Letters.