Quality configurable reduce-and-rank for energy efficient approximate computing

Approximate computing is an emerging design paradigm that exploits the intrinsic ability of applications to produce acceptable outputs even when their computations are executed approximately. In this work, we explore approximate computing for a key computation pattern, Reduce-and-Rank (RnR), which is prevalent in a wide range of workloads including video processing, recognition, search and data mining. An RnR kernel performs a reduction operation (e.g., distance computation, dot product, L1-norm) between an input vector and each of a set of reference vectors, and ranks the reduction outputs to select the top reference vectors for the current input. We propose two complementary approximation strategies for the RnR computation pattern. The first is interleaved reduction-and-ranking, wherein the vector reductions are decomposed into multiple partial reductions and interleaved with the rank computation. Leveraging this transformation, we propose the use of intermediate reduction results and ranks to identify future computations that are likely to have low impact on the output, and can hence be approximated. The second strategy, input similarity based approximation, exploits the spatial or temporal correlation of inputs (e.g., pixels of an image or frames of a video) to identify computations that are amenable to approximation. These strategies address a key challenge in approximate computing - identification of which computations to approximate - and may be used to drive any approximation mechanism such as computation skipping and precision scaling to realize performance or energy improvements. A second key challenge in approximate computing is that the extent to which computations can be approximated varies significantly from application to application, and across inputs for even a single application. Hence, quality configurability, or the ability to automatically modulate the degree of approximation at runtime is essential. To enable quality configurability in RnR kernels, we propose a kernel-level quality metric that correlates well to application-level quality, and identify key parameters that can be used to tune the proposed approximation strategies dynamically. We develop a runtime framework that modulates the identified parameters during execution of RnR kernels to minimize their energy while meeting a given target quality. To evaluate the proposed concepts, we designed quality-configurable hardware implementations of 6 RnR-based applications from the recognition, mining, search and video processing application domains in 45nm technology. Our experiments demonstrate 1.06X-2.18X reduction in energy consumption with virtually no loss in output quality (<;0.5%) at the application-level. The energy benefits further improve up to 2.38X and 2.5X when the quality constraints are relaxed to 2.5% and 5% respectively.

[1]  Scott A. Mahlke,et al.  SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Ilia Polian,et al.  Adaptive voltage over-scaling for resilient applications , 2011, 2011 Design, Automation & Test in Europe.

[3]  Lingamneni Avinash,et al.  Energy parsimonious circuit design through probabilistic pruning , 2011, 2011 Design, Automation & Test in Europe.

[4]  Sandeep K. Gupta,et al.  Approximate logic synthesis for error tolerant applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[5]  Sherief Reda,et al.  ABACUS: A technique for automated behavioral synthesis of approximate computing circuits , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[6]  Kaushik Roy,et al.  SALSA: Systematic logic synthesis of approximate circuits , 2012, DAC Design Automation Conference 2012.

[7]  Arnab Raha,et al.  A Power Efficient Video Encoder Using Reconfigurable Approximate Arithmetic Units , 2014, 2014 27th International Conference on VLSI Design and 2014 13th International Conference on Embedded Systems.

[8]  Sandeep K. Gupta,et al.  A new circuit simplification method for error tolerant applications , 2011, 2011 Design, Automation & Test in Europe.

[9]  Melvin A. Breuer,et al.  Multi-media applications and imprecise computation , 2005, 8th Euromicro Conference on Digital System Design (DSD'05).

[10]  Kaushik Roy,et al.  Quality programmable vector processors for approximate computing , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Andrew B. Kahng,et al.  Accuracy-configurable adder for approximate arithmetic designs , 2012, DAC Design Automation Conference 2012.

[12]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[13]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[14]  Anand Raghunathan,et al.  Best-effort computing: Re-thinking parallel software and hardware , 2010, Design Automation Conference.

[15]  Puneet Gupta,et al.  Trading Accuracy for Power with an Underdesigned Multiplier Architecture , 2011, 2011 24th Internatioal Conference on VLSI Design.

[16]  Kaushik Roy,et al.  Dynamic effort scaling: Managing the quality-efficiency tradeoff , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Kaushik Roy,et al.  ASLAN: Synthesis of approximate sequential circuits , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Krishna V. Palem,et al.  Optimizing energy to minimize errors in dataflow graphs using approximate adders , 2010, CASES '10.