论文信息 - AxMemo: Hardware-Compiler Co-Design for Approximate Code Memoization

AxMemo: Hardware-Compiler Co-Design for Approximate Code Memoization

Historically, continuous improvements in general-purpose processors have fueled the economic success and growth of the IT industry. However, the diminishing benefits from transistor scaling and conventional optimization techniques necessitates moving beyond common practices. Approximate computing is one such unconventional technique that has shown promise in pushing the boundaries of general-purpose processing. This paper sets out to employ approximation for processors that are commonly used in cyber-physical domains and may become building blocks of Internet of Things. To this end, we propose AxMemo to exploit the computation redundancy that stems from data similarity in the inputs of code blocks. Such input behavior is prevalent in cyber-physical systems as they deal with real-world data that naturally harbors redundancy. Therefore, in contrast to existing memoization techniques that replace costly floating-point arithmetic operations with limited number of inputs, AxMemo focuses on memoizing blocks of code with potentially many inputs. As such, AxMemo aims to replace long sequences of instructions with a few hash and lookup operations. By reducing the number of dynamic instructions, AxMemo alleviates the von Neumann and execution overheads of passing instructions through the processor pipeline altogether. The challenge AxMemo facing is to provide low-cost hashing mechanisms that can generate rather unique signature for each multi-input combination. To address this challenge, we develop a novel use of Cyclic Redundancy Checking (CRC) to hash the inputs. To increase lookup table hit rate, AxMemo employs a two-level memo-ization lookup, which utilizes small dedicated SRAM and spare storage in the last level cache. These solutions enable AxMemo to efficiently memoize relatively large code regions with variable input sizes and types using the same underlying hardware. Our experiment shows that AxMemo offers 2.64× speedup and 2.58× energy reduction with mere 0.2% of quality loss averaged across ten benchmarks. These benefits come with an area overhead of just 2.1%.

[1] Hadi Esmaeilzadeh,et al. Neural acceleration for GPU throughput processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2] Wei Zhang,et al. Low-Power FPGA Design Using Memoization-Based Approximate Computing , 2016, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3] Gu-Yeon Wei,et al. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[4] Luis Ceze,et al. Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[5] Guowei Zhang,et al. Leveraging Hardware Caches for Memoization , 2018, IEEE Computer Architecture Letters.

[6] David M. Brooks,et al. ISA-independent workload characterization and its implications for specialized architectures , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7] Alexander Moreno,et al. Speeding up Large-Scale Financial Recomputation with Memoization , 2014, 2014 Seventh Workshop on High Performance Computational Finance.

[8] Tajana Rosing,et al. Nvalt: Nonvolatile Approximate Lookup Table for GPU Acceleration , 2018, IEEE Embedded Systems Letters.

[9] Natalie D. Enright Jerger,et al. The Bunker Cache for spatio-value approximation , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[11] Mateo Valero,et al. ATM: Approximate Task Memoization in the Runtime System , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[12] Josep Torrellas,et al. SoftSig: Software-Exposed Hardware Signatures for Code Analysis and Optimization , 2008, IEEE Micro.

[13] Hiroshi Nakashima,et al. Design and evaluation of an auto-memoization processor , 2007, Parallel and Distributed Computing and Networks.

[14] W. W. PETERSONt,et al. Cyclic Codes for Error Detection * , 2022 .

[15] Mario Badr,et al. Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[16] Hadi Esmaeilzadeh,et al. AxBench: A Multiplatform Benchmark Suite for Approximate Computing , 2017, IEEE Design & Test.

[17] Song Liu,et al. Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[18] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[19] Onur Mutlu,et al. RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads , 2016, ACM Trans. Archit. Code Optim..

[20] Nam Sung Kim,et al. Load-Triggered Warp Approximation on GPU , 2018, ISLPED.

[21] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[22] S. Richardson. Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation , 1992 .

[23] Farinaz Koushanfar,et al. LookNN: Neural network with no multiplication , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[24] Wen-mei W. Hwu,et al. Compiler-directed dynamic computation reuse: rationale and initial results , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[25] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[26] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27] Henry Hoffmann,et al. Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[28] Scott A. Mahlke,et al. SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29] Jacob Nelson,et al. Approximate storage in solid-state memories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31] Mikko H. Lipasti,et al. Value locality and load value prediction , 1996, ASPLOS VII.

[32] Scott A. Mahlke,et al. Paraprox: pattern-based approximation for data parallel applications , 2014, ASPLOS.

[33] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .

[34] Nikolaos Hardavellas,et al. Temporal Approximate Function Memoization , 2018, IEEE Micro.