A Speed-up Technique for an Auto-Memoization Processor by Reusing Partial Results of Instruction Regions

We have proposed an auto-memoization processor based on computation reuse. The auto-memoization processor dynamically detects functions and loop iterations as reusable blocks, and memoizes them automatically. In the past model, computation reuse cannot be applied if the current input sequence even differs by only one input value from the past input sequences, since processing results will differ. This paper proposes a new partial reuse model, which can apply computation reuse to the early part of a reusable block as long as the early part of the current input sequence matches one of the past sequences. In addition, in order to acquire sufficient benefit from the partial reuse model, we also propose a technique that reduces the searching overhead for memoization table by partitioning it. The result of the experiment with SPEC CPU95 suite benchmarks shows that the new method improves the maximum speedup from 40.6% to 55.1%, and the average speedup from 10.6% to 22.8%.

[1]  Hiroshi Nakashima,et al.  Design and evaluation of an auto-memoization processor , 2007, Parallel and Distributed Computing and Networks.

[2]  Thomas A. Ziaja,et al.  Sparc T4: A Dynamically Threaded Server-on-a-Chip , 2012, IEEE Micro.

[3]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[4]  Antonio González,et al.  Trace-level speculative multithreaded architecture , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[5]  Lin Gao,et al.  Loop recreation for thread‐level speculation on multicore processors , 2010, Softw. Pract. Exp..

[6]  Hiroshi Matsuo,et al.  A Speed-Up Technique for an Auto-Memoization Processor by Collectively Reusing Continuous Iterations , 2010, 2010 First International Conference on Networking and Computing.

[7]  Yuancheng Li,et al.  A Cost Estimation Model for Speculative Thread Partitioning , 2010, International Symposium on Parallel and Distributed Processing with Applications.

[8]  Jian Huang,et al.  Exploiting basic block value locality with block reuse , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[9]  Gurindar S. Sohi,et al.  Register integration: a simple and efficient implementation of squash reuse , 2000, MICRO 33.

[10]  Manoj Franklin,et al.  A general compiler framework for speculative multithreaded processors , 2004, IEEE Transactions on Parallel and Distributed Systems.

[11]  Kai Wang,et al.  Highly accurate data value prediction using hybrid predictors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[12]  Youfeng Wu,et al.  Better exploration of region-level value locality with integrated computation reuse and value prediction , 2001, ISCA 2001.

[13]  Hiroshi Matsuo,et al.  Input Entry Integration for an Auto-Memoization Processor , 2011, 2011 Second International Conference on Networking and Computing.

[14]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[15]  Antonio González,et al.  Compiler analysis for trace-level speculative multithreaded architectures , 2005, 9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05).

[16]  Timothy Sherwood,et al.  Ternary CAM Power and Delay Model: Extensions and Uses , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[17]  Dean M. Tullsen,et al.  Multithreaded value prediction , 2005, 11th International Symposium on High-Performance Computer Architecture.

[18]  Lin Gao,et al.  Loop recreation for thread-level speculation , 2007, 2007 International Conference on Parallel and Distributed Systems.