Architecture-Aware Approximate Computing

Deliberate use of approximate computing has been an active research area recently. Observing that many application programs from different domains can live with less-than-perfect accuracy, existing techniques try to trade off program output accuracy with performance-energy savings. While these works provide point solutions, they leave three critical questions regarding approximate computing unanswered, especially in the context of dropping/skipping costly data accesses: (i) what is the maximum potential of skipping (i.e., not performing) data accesses under a given inaccuracy bound?; (ii) can we identify the data accesses to drop randomly, or is being architecture aware (i.e., identifying the costliest data accesses in a given architecture) critical?; and (iii) do two executions that skip the same number of data accesses always result in the same output quality (error)? This paper first provides answers to these questions using ten multithreaded workloads, and then, motivated by the negative answer to the third question, presents a program slicing-based approach that identifies the set of data accesses to drop such that (i) the resulting performance/energy benefits are maximized and (ii) the execution remains within the error (inaccuracy) bound specified by the user. Our slicing-based approach first uses backward slicing and then forward slicing to decide the set of data accesses to drop. Our experimental evaluations using ten multithreaded workloads show that, when averaged over all benchmark programs we have, 8.8% performance improvement and 13.7% energy saving are possible when we set the error bound to 2%, and the corresponding improvements jump to 15% and 25%, respectively, when the error bound is raised to 4%.

[1]  Martin C. Rinard,et al.  Unsynchronized Techniques for Approximate Parallel Computing , 2012 .

[2]  William Pugh,et al.  Iteration space slicing and its application to communication optimization , 1997, ICS '97.

[3]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[5]  Mario Badr,et al.  Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  Sherief Reda,et al.  ABACUS: A technique for automated behavioral synthesis of approximate computing circuits , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Frank Tip,et al.  A survey of program slicing techniques , 1994, J. Program. Lang..

[8]  Scott A. Mahlke,et al.  SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Martin C. Rinard,et al.  Efficiency Limits for Value-Deviation-Bounded Approximate Communication , 2015, IEEE Embedded Systems Letters.

[10]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[11]  Asit K. Mishra,et al.  iACT: A Software-Hardware Framework for Understanding the Scope of Approximate Computing , 2014 .

[12]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[13]  Kaushik Roy,et al.  MACACO: Modeling and analysis of circuits for approximate computing , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[14]  Joseph Robert Horgan,et al.  Dynamic program slicing , 1990, PLDI '90.

[15]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[16]  Scott A. Mahlke,et al.  Paraprox: pattern-based approximation for data parallel applications , 2014, ASPLOS.

[17]  Luis Ceze,et al.  Architecture support for disciplined approximate programming , 2012, ASPLOS XVII.

[18]  Mahmut T. Kandemir,et al.  Improving bank-level parallelism for irregular applications , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Donald Yeung,et al.  Exploiting Soft Computing for Increased Fault Tolerance , 2006 .

[20]  Mahmut T. Kandemir,et al.  Enhancing computation-to-core assignment with physical location information , 2018, PLDI.

[21]  Uwe Naumann,et al.  Towards automatic significance analysis for approximate computing , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[22]  Kaushik Roy,et al.  Design of voltage-scalable meta-functions for approximate computing , 2011, 2011 Design, Automation & Test in Europe.

[23]  Ismail Akturk,et al.  On Quantification of Accuracy Loss in Approximate Computing , 2015 .

[24]  Qiang Xu,et al.  ApproxIt: An approximate computing framework for iterative methods , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[25]  Mahmut T. Kandemir,et al.  Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[26]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[27]  Yong Zhang,et al.  An energy efficient approximate adder with carry skip for error resilient neuromorphic VLSI systems , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[28]  Jie Han,et al.  Approximate computing: An emerging paradigm for energy-efficient design , 2013, 2013 18th IEEE European Test Symposium (ETS).

[29]  Vijayalakshmi Srinivasan,et al.  Programming with relaxed synchronization , 2012, RACES '12.

[30]  James Demmel,et al.  Precimonious: Tuning assistant for floating-point precision , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[31]  Martin C. Rinard,et al.  Chisel: reliability- and accuracy-aware optimization of approximate computational kernels , 2014, OOPSLA.

[32]  Martin C. Rinard,et al.  Automatically identifying critical input regions and code in applications , 2010, ISSTA '10.

[33]  Mahmut T. Kandemir,et al.  Controlled Kernel Launch for Dynamic Parallelism in GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[34]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[36]  Mahmut T. Kandemir,et al.  Memory Row Reuse Distance and its Role in Optimizing Application Performance , 2015, SIGMETRICS 2015.

[37]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[38]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[39]  Mahmut T. Kandemir,et al.  Co-optimizing memory-level parallelism and cache-level parallelism , 2019, PLDI.

[40]  Luis Ceze,et al.  General-purpose code acceleration with limited-precision analog computation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[41]  Mahmut T. Kandemir,et al.  Data Movement Aware Computation Partitioning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[42]  Scott A. Mahlke,et al.  Scaling Performance via Self-Tuning Approximation for Graphics Engines , 2014, TOCS.

[43]  Kaushik Roy,et al.  Quality programmable vector processors for approximate computing , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[44]  Keshav Pingali,et al.  Proactive Control of Approximate Programs , 2016, ASPLOS.

[45]  Mahmut T. Kandemir,et al.  Opportunistic Computing in GPU Architectures , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[46]  Kaushik Roy,et al.  Approximate computing: An integrated hardware approach , 2013, 2013 Asilomar Conference on Signals, Systems and Computers.

[47]  Naresh R. Shanbhag,et al.  Energy-efficient signal processing via algorithmic noise-tolerance , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).