论文信息 - An implementation of Auto-Memoization mechanism on ARM-based superscalar processor

An implementation of Auto-Memoization mechanism on ARM-based superscalar processor

We have proposed a processor called Auto-Memoization Processor which is based on computation reuse. Until now, we have implemented the auto-memoization mechanism on a single-issue non-pipelined SPARC processor and studied the processor. The processor dynamically detects functions and loop iterations as reusable blocks, and memoizes them automatically. In addition, the processor can apply computation reuse to the blocks with a little reuse overhead. However, the fine evaluation result of the processor may not guarantee enough practicality. This is because instead of such a simple architecture, superscalar architectures are now widely used for generic processors for PCs, embedded processors, and other various processors. Hence, we examine problems which will be caused in the case of implementing the auto-memoization mechanism on an ARM-based superscalar processor and design the ARM-based Auto-Memoization Processor. For example, one of such problems is that pipeline stalls are caused because of the reuse overhead. To solve this problem, we implement a mechanism for overlapping the reuse overhead and the pipeline execution of the processor. The evaluation result with SPEC CPU95 benchmark suite shows that the ARM-based Auto-Memoization Processor can also achieve speed-up as well as the previous SPARC-based Auto-Memoization Processor. In this paper, we describe the implementation and the evaluation result of the ARM-based Auto-Memoization Processor.

Yasuhiko Nakashima | Tomoaki Tsumura | Yuuki Shibata | Takanori Tsumura

[1] Rajiv Gupta,et al. Load and store reuse using register file contents , 2001, ICS '01.

[2] G.S. Sohi,et al. Dynamic instruction reuse , 1997, ISCA '97.

[3] Hiroshi Nakashima,et al. Design and evaluation of an auto-memoization processor , 2007, Parallel and Distributed Computing and Networks.

[4] Jun Yang,et al. Load redundancy removal through instruction reuse , 2000, Proceedings 2000 International Conference on Parallel Processing.

[5] Antonio González,et al. Trace-level reuse , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[6] Hiroshi Matsuo,et al. A Speed-up Technique for an Auto-Memoization Processor by Reusing Partial Results of Instruction Regions , 2012, 2012 Third International Conference on Networking and Computing.

[7] David Seal,et al. ARM Architecture Reference Manual , 2001 .