Energy-efficient data prefetch buffering for low-end embedded processors

An energy-efficient architecture should jointly optimize energy consumption and throughput, as captured by the Energy-Delay-Square Product (ED2P) metric. This paper introduces a prefetch data buffer micro-architecture, which achieves that goal with the aid of software-inserted control words to govern the prefetch process. The proposed architecture is aimed at low-end embedded processors, which, so as to reduce energy consumption, lack a cache-based memory hierarchy. By identifying after compilation which data should be prefetched and modifying the object code, the rate of prefetch misses is reduced. And by pre-computing memory addresses using auxiliary software after compilation and modifying the object code, address computation by hardware at run time is avoided, reducing pipeline stalls and, thus, improving throughput. Additionally in the case of branches, by prefetching two data items at any one time, alternative instruction outcomes are anticipated. The paper contains results from running a range of well-known and representative benchmarks on the proposed architecture. There was an improvement of 620% compared to an unbuffered architecture in execution times when tested over those seven benchmarks. Furthermore, the average ED2P for the buffered architecture when normalized against the same architecture without buffering was found to vary between 54% and 90% according to benchmarking, though there is a cost in code size increase. That is to say, for the benchmarks tested there was a net energy efficiency improvement of between 10% and 46% in comparison with the equivalent unbuffered architecture with a lower area overhead.

[1]  Jignesh M. Patel,et al.  Data prefetching by dependence graph precomputation , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[2]  Theodore Antonakopoulos,et al.  Reconfigurable Network Processors Based on Field Programmable System Level Integrated Circuits , 2000, FPL.

[3]  David Black-Schaffer,et al.  AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[4]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[5]  Stijn Eyerman,et al.  The shape of the processor design space and its implications for early stage explorations , 2005 .

[6]  H. Peter Hofstee,et al.  PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor , 2016, IEEE Computer Architecture Letters.

[7]  Simon Segars Low power design techniques for microprocessors , 2000 .

[8]  Surendra Byna,et al.  Hiding I/O latency with pre-execution prefetching for parallel applications , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[10]  Peng Wu,et al.  Using advanced compiler technology to exploit the performance of the Cell Broadband Enginee , 2006 .

[11]  Zhimin Gu,et al.  Prefetching in Embedded Mobile Systems Can Be Energy-Efficient , 2011, IEEE Computer Architecture Letters.

[12]  Tim Wilmshurst Designing Embedded Systems with PIC Microcontrollers: Principles and Applications , 2006 .

[13]  Massoud Pedram,et al.  Low power design methodologies , 1996 .

[14]  Surendra Byna,et al.  Taxonomy of Data Prefetching for Multicore Processors , 2009, Journal of Computer Science and Technology.

[15]  Martin Burtscher,et al.  Future execution: a hardware prefetching technique for chip multiprocessors , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[16]  Martin Fleury,et al.  Software-Controlled Instruction Prefetch Buffering for Low-End Processors , 2015, J. Circuits Syst. Comput..

[17]  Norman P. Jouppi,et al.  CACTI 2.0: An Integrated Cache Timing and Power Model , 2002 .

[18]  Jorg Henkel,et al.  Designing Embedded Processors A Low Power Perspective , 2011 .

[19]  Bill Moyer,et al.  A low power unified cache architecture providing power and performance flexibility , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).

[20]  Paul I. Pénzes,et al.  The design of an asynchronous MIPS R3000 microprocessor , 1997, Proceedings Seventeenth Conference on Advanced Research in VLSI.

[21]  Israel Koren,et al.  Energy characterization of hardware-based data prefetching , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[22]  Manish Gupta,et al.  Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors , 2000, IEEE Micro.

[23]  B. M. Gordon,et al.  Supply and threshold voltage scaling for low power CMOS , 1997, IEEE J. Solid State Circuits.

[24]  Frank Vahid,et al.  A highly configurable cache architecture for embedded systems , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[25]  Tao Zhang,et al.  Prefetching irregular references for software cache on cell , 2008, CGO '08.

[26]  Krishna V. Palem,et al.  A framework for data prefetching using off-line training of Markovian predictors , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[27]  Courtenay T. Vaughan,et al.  Energy Delay Product , 2013 .

[28]  Zhigang Mao,et al.  A novel hardware prefetching scheme exploiting 2-D spatial locality in multimedia applications , 2011, 2011 9th IEEE International Conference on ASIC.

[29]  Arne Martin Holberg Innovative Techniques for Extremely Low Power Consumption with 8-bit Microcontrollers , 2006 .

[30]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[31]  Alain J. Martin Towards an energy complexity of computation , 2001, Inf. Process. Lett..

[32]  Courtenay T. Vaughan,et al.  Energy-Efficient High Performance Computing: Measurement and Tuning , 2012, HiPC 2012.

[33]  Mahmut T. Kandemir,et al.  Machine learning techniques for improved data prefetching , 2015, 5th International Conference on Energy Aware Computing Systems & Applications.

[34]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[35]  Alain J. Martin,et al.  ET 2 : a metric for time and energy efficiency of computation , 2002 .