Value Reuse Potential in ARM Architectures

Code execution in modern superscalar processors is inherently redundant. Many instructions execute repeatedly with the same inputs, producing the same outputs, thus wasting resources in the process. Value reuse techniques memorize previous executions of instructions, blocks or traces which may be reused if they appear again with the same input contexts. Although trace reuse techniques show great potential for both performance and energy consumption improvement, they have not been studied yet in one of the most widely available computer architectures - the ARM architecture. In this paper, the main issues with reusing traces in instruction sets with conditional execution are revisited. Afterwards, the reuse potential in the benchmark suite MiBench is analyzed varying (i) how traces are generated, and (ii) the size of reuse tables. Our results show that a memoization table of 32 KiB allows to reuse 18.36% of the total instructions on average.

[1]  Brad Calder,et al.  Dynamic prediction of critical path instructions , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[2]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[3]  Jordi Tubella,et al.  The Performance Potential of Data Value Reuse , 1998 .

[4]  Maurício L. Pilla,et al.  Limits for a feasible speculative trace reuse implementation , 2007, Int. J. High Perform. Syst. Archit..

[5]  D. Lilja,et al.  Improving Value Prediction by Exploiting Both Operand and Output Value Locality , 1999 .

[6]  Rajiv Gupta,et al.  Load and store reuse using register file contents , 2001, ICS '01.

[7]  Chung-Ho Chen,et al.  Energy-Efficient Trace Reuse Cache for Embedded Processors , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  Jian Huang,et al.  Exploring sub-block value reuse for superscalar processors , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[9]  Jian Huang,et al.  Extending Value Reuse to Basic Blocks with Compiler Support , 2000, IEEE Trans. Computers.

[10]  Dean M. Tullsen,et al.  Storageless value prediction using prior register values , 1999, ISCA.

[11]  Kai Wang,et al.  Techniques for performing highly accurate data value prediction , 1998, Microprocess. Microsystems.

[12]  Chia-Hung Liao,et al.  Exploiting speculative value reuse using value prediction , 2002 .

[13]  Jun Yang,et al.  Load redundancy removal through instruction reuse , 2000, Proceedings 2000 International Conference on Parallel Processing.

[14]  Glenn Reinman,et al.  Selective value prediction , 1999, ISCA.

[15]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[16]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[17]  Yasuhiko Nakashima,et al.  An implementation of Auto-Memoization mechanism on ARM-based superscalar processor , 2014, 2014 International Symposium on System-on-Chip (SoC).

[18]  F. Gabbay Speculative Execution based on Value Prediction Research Proposal towards the Degree of Doctor of Sciences , 1996 .

[19]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[20]  Maurício L. Pilla,et al.  A Speculative Trace Reuse Architecture with Reduced Hardware Requirements , 2006, 2006 18th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'06).

[21]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[22]  Hiroshi Nakashima,et al.  Design and evaluation of an auto-memoization processor , 2007, Parallel and Distributed Computing and Networks.

[23]  Stefanos Kaxiras,et al.  Low power microarchitecture with instruction reuse , 2008, CF '08.

[24]  Toshinori Sato,et al.  A trace-level value predictor for Contrail processors , 2003, CARN.

[25]  David Seal,et al.  ARM Architecture Reference Manual , 2001 .

[26]  Richard T. Witek,et al.  StrongARM: a high-performance ARM processor , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[27]  Luca Benini,et al.  Spatial Memoization: Concurrent Instruction Reuse to Correct Timing Errors in SIMD Architectures , 2013, IEEE Transactions on Circuits and Systems II: Express Briefs.

[28]  Dirk Grunwald,et al.  Confidence estimation for speculation control , 1998, ISCA.

[29]  Felipe Maia Galvão França,et al.  The dynamic trace memoization reuse technique , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[30]  Hiroshi Matsuo,et al.  Input Entry Integration for an Auto-Memoization Processor , 2011, 2011 Second International Conference on Networking and Computing.

[31]  Gurindar S. Sohi,et al.  Understanding the differences between value prediction and instruction reuse , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[32]  Youfeng Wu,et al.  Better exploration of region-level value locality with integrated computation reuse and value prediction , 2001, ISCA 2001.

[33]  Luca Benini,et al.  Temporal memoization for energy-efficient timing error recovery in GPGPUs , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[34]  Antonio González,et al.  Dynamic removal of redundant computations , 1999, ICS '99.

[35]  G.S. Sohi,et al.  Dynamic Instruction Reuse , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[36]  Antonio González,et al.  Trace-level reuse , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[37]  Gurindar S. Sohi,et al.  Register integration: a simple and efficient implementation of squash reuse , 2000, MICRO 33.