Exploiting the Potential of Computation Reuse Through Approximate Computing

Approximate computing, which tackles tradeoff between computation quality (e.g., accuracy) and computation efforts, is becoming a promising technique to improve performance for many mission-non-critical and error-tolerant applications. The computations in such applications usually exhibit superior value locality, i.e., computations performed by a function or code region are very likely to produce “similar” results. Reusing such similar results can bypass redundant computations, as long as “exact” results are not mandatory. However, conventional computation reuse techniques are less effective in approximate computing paradigm. The input values of two computations have to be identical to reuse one for another, hence “exact” in nature. We propose ACR, an approximate computation reuse scheme, to enable computation reuse for approximate computing. ACR relaxes the exact matching requirement in inputs to some extent regulated by “similarity” quantification, thereby shifting the exact computation reuse paradigm to its approximate counterpart. Specifically, using statistical approaches ACR first provides an input significance-aware similarity quantification scheme to calculate similarity between different computations. ACR also provides a regression based branch prediction technique to resolve conditional branches inside a computation. Furthermore, on top of proposed approximate computing framework, ACR presents a parallel implementation of computing engine to carry out branch resolving and similar computation searching. Experimental results show that the ACR scheme could effectively exploit the potential of computation reuse for approximate computing and achieves 3.12 times speedup on average for a set of approximate benchmarks.

[1]  Hadi Esmaeilzadeh,et al.  Prediction-Based Quality Control for Approximate Accelerators , 2015 .

[2]  Gurindar S. Sohi,et al.  An empirical analysis of instruction repetition , 1998, ASPLOS VIII.

[3]  Carlos Alvarez,et al.  On the potential of tolerant region reuse for multimedia applications , 2001, ICS '01.

[4]  Youfeng Wu,et al.  Better exploration of region-level value locality with integrated computation reuse and value prediction , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[5]  Mikko H. Lipasti,et al.  On the value locality of store instructions , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  John Sartori,et al.  Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications , 2013, IEEE Trans. Multim..

[7]  Samuel P. Harbison An architectural alternative to optimizing compilers , 1982, ASPLOS I.

[8]  Hiroshi Matsuo,et al.  A Speed-up Technique for an Auto-Memoization Processor by Reusing Partial Results of Instruction Regions , 2012, 2012 Third International Conference on Networking and Computing.

[9]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[10]  K. Pagiamtzis,et al.  Content-addressable memory (CAM) circuits and architectures: a tutorial and survey , 2006, IEEE Journal of Solid-State Circuits.

[11]  Carlos Alvarez Martinez,et al.  Dynamic Tolerance Region Computing for Multimedia , 2012, IEEE Transactions on Computers.

[12]  Xingjian Li,et al.  Floating-point mixed-radix FFT core generation for FPGA and comparison with GPU and CPU , 2011, 2011 International Conference on Field-Programmable Technology.

[13]  Vladan Papic,et al.  K-means image segmentation on massively parallel GPU architecture , 2012, 2012 Proceedings of the 35th International Convention MIPRO.

[14]  J. Bouchaud An introduction to statistical finance , 2002 .

[15]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[16]  Scott A. Mahlke,et al.  Paraprox: pattern-based approximation for data parallel applications , 2014, ASPLOS.

[17]  Gurindar S. Sohi,et al.  Understanding the differences between value prediction and instruction reuse , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[18]  M. Valero,et al.  Fuzzy memoization for floating-point multimedia applications , 2005, IEEE Transactions on Computers.

[19]  Michael S. Hsiao,et al.  Region-level approximate computation reuse for power reduction in multimedia applications , 2005, ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005..

[20]  Iain Bate,et al.  Efficient integration of bimodal branch prediction and pipeline analysis , 2005, 11th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA'05).

[21]  Douglas L. Jones,et al.  Scalable stochastic processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[22]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[23]  Timothy Sherwood,et al.  Modeling TCAM power for next generation network devices , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[24]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[25]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Saeid Belkasim,et al.  Parallel Processing of DCT on GPU , 2011, 2011 Data Compression Conference.

[27]  Anand Raghunathan,et al.  Best-effort computing: Re-thinking parallel software and hardware , 2010, Design Automation Conference.

[28]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[29]  Wen-mei W. Hwu,et al.  Compiler-directed dynamic computation reuse: rationale and initial results , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[30]  Asit K. Mishra,et al.  iACT: A Software-Hardware Framework for Understanding the Scope of Approximate Computing , 2014 .

[31]  Meng-Fan Chang,et al.  Energy-efficient non-volatile TCAM search engine design using priority-decision in memory technology for DPI , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[32]  Joseph A. C. Delaney Sensitivity analysis , 2018, The African Continental Free Trade Area: Economic and Distributional Effects.

[33]  Rakesh Kumar,et al.  On reconfiguration-oriented approximate adder design and its application , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[34]  Sanjay J. Patel,et al.  Y-branches: when you come to a fork in the road, take it , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[35]  Mikko H. Lipasti,et al.  Exceeding the dataflow limit via value prediction , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[36]  Tse-Yu Yeh Two-level adaptive branch prediction and instruction fetch mechanisms for high performance superscalar processors , 1993 .

[37]  Jason Cong,et al.  Energy-efficient computing using adaptive table lookup based on nonvolatile memories , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[38]  Gu-Yeon Wei,et al.  Toward Cache-Friendly Hardware Accelerators , 2015 .