Warp Lockstep Warp Bottleneck L Multi Lookup Table In Parallel Dynamic Table with LRU Warp Passthrough ARGA SolutionsExisting Limitations Speedup Wrap Balance High Hitrate

Many data-driven applications including computer vision, speech recognition, and medical diagnostics show tolerance to error during computation. These applications are often accelerated on GPUs, but high computational costs limit performance and increase energy usage. In this paper, we present ARGA, an approximate computing technique capable of accelerating GPGPU applications. ARGA provides an approximate lookup table to GPGPU cores to avoid recomputing instructions with identical or similar values. We propose multi-table parallel lookup which enables computational reuse to significantly speed-up GPGPU computation by checking incoming instructions in parallel. The inputs of each operation are searched for in a lookup table. Matches resulting in an exact or low error are removed from the floating point pipeline and used directly as output. Matches producing highly inaccurate results are computed on exact hardware to minimize application error. We simulate our design by placing ARGA within each core of an Nvidia Kepler Architecture Titan and an AMD Southern Island 7970. We show our design improves performance throughput by up to 2.7× and improves EDP by 5.3× for 6 GPGPU applications while maintaining less than 5% output error. We also show ARGA accelerates inference of a LeNet NN by 2.1× and improves EDP by 3.7× without significantly impacting classification accuracy. CCS CONCEPTS • Computer systems organization → Multicore architectures; • Computing methodologies → Machine learning approaches.

[1]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  John Sartori,et al.  Slack redistribution for graceful degradation under voltage overscaling , 2010, 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC).

[4]  Ilia Polian,et al.  Adaptive voltage over-scaling for resilient applications , 2011, 2011 Design, Automation & Test in Europe.

[5]  Jie Han,et al.  Approximate computing: An emerging paradigm for energy-efficient design , 2013, 2013 18th IEEE European Test Symposium (ETS).

[6]  Kaushik Roy,et al.  Scalable Effort Hardware Design , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Fabrizio Lombardi,et al.  A low-power, high-performance approximate multiplier with configurable partial error recovery , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8]  Ciprian Dobre,et al.  Intelligent services for Big Data science , 2014, Future Gener. Comput. Syst..

[9]  Jia Wang,et al.  A High-Throughput Neural Network Accelerator , 2015, IEEE Micro.

[10]  Sherief Reda,et al.  DRUM: A Dynamic Range Unbiased Multiplier for approximate applications , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[11]  Luca Benini,et al.  Approximate associative memristive memory for energy-efficient GPUs , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Tajana Simunic,et al.  Resistive configurable associative memory for approximate computing , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Tajana Simunic,et al.  ACAM: Approximate Computing Based on Adaptive Associative Memory with Online Learning , 2016, ISLPED.

[14]  Tajana Simunic,et al.  CFPU: Configurable floating point multiplier for energy-efficient computing , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Farinaz Koushanfar,et al.  LookNN: Neural network with no multiplication , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[16]  Yurong Liu,et al.  A survey of deep neural network architectures and their applications , 2017, Neurocomputing.

[17]  Tajana Simunic,et al.  ORCHARD: Visual object recognition accelerator based on approximate in-memory processing , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[18]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[19]  Veda C. Storey,et al.  Big data technologies and Management: What conceptual modeling can do , 2017, Data Knowl. Eng..

[20]  Xun Gong,et al.  Multi2Sim Kepler: A detailed architectural GPU simulator , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[21]  Tajana Simunic,et al.  RNSnet: In-Memory Neural Network Acceleration Using Residue Number System , 2018, 2018 IEEE International Conference on Rebooting Computing (ICRC).

[22]  Tajana Simunic,et al.  Program acceleration using nearest distance associative search , 2018, 2018 19th International Symposium on Quality Electronic Design (ISQED).

[23]  Rajesh K. Gupta,et al.  Energy-efficient neural networks using approximate computation reuse , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Farinaz Koushanfar,et al.  RAPIDNN: In-Memory Deep Neural Network Acceleration Framework , 2018, ArXiv.

[25]  Mohsen Imani,et al.  Approximate Computing Using Multiple-Access Single-Charge Associative Memory , 2018, IEEE Transactions on Emerging Topics in Computing.

[26]  Tajana Simunic,et al.  RMAC: Runtime Configurable Floating Point Multiplier for Approximate Computing , 2018, ISLPED.

[27]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Tajana Simunic,et al.  Efficient human activity recognition using hyperdimensional computing , 2018, IOT.

[29]  Tajana Simunic,et al.  ALook: adaptive lookup for GPGPU acceleration , 2019, ASP-DAC.

[30]  Tajana Simunic,et al.  F5-HD: Fast Flexible FPGA-based Framework for Refreshing Hyperdimensional Computing , 2019, FPGA.

[31]  Dipankar Das,et al.  Mixed Precision Training With 8-bit Floating Point , 2019, ArXiv.

[32]  Tajana Simunic,et al.  A Framework for Collaborative Learning in Secure High-Dimensional Space , 2019, 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).