Decanting the Contribution of Instruction Types and Loop Structures in the Reuse of Traces

Reuse has been proposed as a microarchitecture-level mechanism to reduce the amount of executed instructions, collapsing dependencies and freeing resources for other instructions. Previous works have used reuse domains such as memory accesses, integer or not floating point, based on the reusability rate. However, these works have not studied the specific contribution of reusing different subsets of instructions for performance. In this work, we analysed the sensitivity of trace reuse to instruction subsets, comparing their efficiency to their complementary subsets. We also studied the amount of reuse that can be extracted from loops. Our experiments show that disabling trace reuse outside loops does not harm performance but reduces in 12% the number of accesses to the reuse table. Our experiments with reuse subsets show that most of the speedup can be retained even when not reusing all types of instructions previously found in the reuse domain.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Luca Benini,et al.  Spatial Memoization: Concurrent Instruction Reuse to Correct Timing Errors in SIMD Architectures , 2013, IEEE Transactions on Circuits and Systems II: Express Briefs.

[3]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[4]  Gurindar S. Sohi,et al.  Understanding the differences between value prediction and instruction reuse , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[5]  Chung-Ho Chen,et al.  Energy-Efficient Trace Reuse Cache for Embedded Processors , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Jian Huang,et al.  Extending Value Reuse to Basic Blocks with Compiler Support , 2000, IEEE Trans. Computers.

[7]  Felipe Maia Galvão França,et al.  The dynamic trace memoization reuse technique , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[8]  Antonio González,et al.  Trace-level reuse , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[9]  Jian Huang,et al.  Exploring sub-block value reuse for superscalar processors , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[10]  Maurício L. Pilla,et al.  A Speculative Trace Reuse Architecture with Reduced Hardware Requirements , 2006, 2006 18th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'06).

[11]  Rajiv Gupta,et al.  Load and store reuse using register file contents , 2001, ICS '01.

[12]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[13]  A. Roth,et al.  Register integration: a simple and efficient implementation of squash reuse , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[14]  Dirk Grunwald,et al.  Confidence estimation for speculation control , 1998, ISCA.

[15]  Stefanos Kaxiras,et al.  Low power microarchitecture with instruction reuse , 2008, CF '08.

[16]  Chia-Hung Liao,et al.  Exploiting speculative value reuse using value prediction , 2002 .

[17]  Hiroshi Nakashima,et al.  Design and evaluation of an auto-memoization processor , 2007, Parallel and Distributed Computing and Networks.

[18]  Toshinori Sato,et al.  A trace-level value predictor for Contrail processors , 2003, CARN.

[19]  Maurício L. Pilla,et al.  The limits of speculative trace reuse on deeply pipelined processors , 2003, Proceedings. 15th Symposium on Computer Architecture and High Performance Computing.

[20]  Luca Benini,et al.  Temporal memoization for energy-efficient timing error recovery in GPGPUs , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Antonio González,et al.  Dynamic removal of redundant computations , 1999, ICS '99.

[22]  Jun Yang,et al.  Load redundancy removal through instruction reuse , 2000, Proceedings 2000 International Conference on Parallel Processing.

[23]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[24]  Maurício L. Pilla,et al.  Limits for a feasible speculative trace reuse implementation , 2007, Int. J. High Perform. Syst. Archit..

[25]  Hiroshi Matsuo,et al.  Input Entry Integration for an Auto-Memoization Processor , 2011, 2011 Second International Conference on Networking and Computing.

[26]  Youfeng Wu,et al.  Better exploration of region-level value locality with integrated computation reuse and value prediction , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[27]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[28]  D. Lilja,et al.  Improving Value Prediction by Exploiting Both Operand and Output Value Locality , 1999 .

[29]  Kai Wang,et al.  Techniques for performing highly accurate data value prediction , 1998, Microprocess. Microsystems.