Neural acceleration for GPU throughput processors

Graphics Processing Units (GPUs) can accelerate diverse classes of applications, such as recognition, gaming, data analytics, weather prediction, and multimedia. Many of these applications are amenable to approximate execution. This application characteristic provides an opportunity to improve GPU performance and efficiency. Among approximation techniques, neural accelerators have been shown to provide significant performance and efficiency gains when augmenting CPU processors. However, the integration of neural accelerators within a GPU processor has remained unexplored. GPUs are, in a sense, many-core accelerators that exploit large degrees of data-level parallelism in the applications through the SIMT execution model. This paper aims to harmoniously bring neural and GPU accelerators together without hindering SIMT execution or adding excessive hardware overhead. We introduce a low overhead neurally accelerated architecture for GPUs, called NGPU, that enables scalable integration of neural accelerators for large number of GPU cores. This work also devises a mechanism that controls the tradeoff between the quality of results and the benefits from neural acceleration. Compared to the baseline GPU architecture, cycle-accurate simulation results for NGPU show a 2.4× average speedup and a 2.8× average energy reduction within 10% quality loss margin across a diverse set of benchmarks. The proposed quality control mechanism retains a 1.9 ×s average speedup and a 2.1 × energy reduction while reducing the degradation in the quality of results to 2.5%. These benefits are achieved by less than 1% area overhead.

[1]  Henry Hoffmann,et al.  Patterns and statistical analysis for understanding reduced resource computing , 2010, OOPSLA.

[2]  Quinn Jacobson,et al.  ERSA: error resilient system architecture for probabilistic applications , 2010, DATE 2010.

[3]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[4]  Karthik Pattabiraman,et al.  Flicker: Saving Refresh-Power in Mobile Devices through Critical Data Partitioning , 2009 .

[5]  Krishna V. Palem,et al.  Ultra-Efficient (Embedded) SOC Architectures based on Probabilistic CMOS (PCMOS) Technology , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[6]  Mahmut T. Kandemir,et al.  A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[7]  Andreas Gerstlauer,et al.  Approximate logic synthesis under general error magnitude and frequency constraints , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[8]  Scott A. Mahlke,et al.  Paraprox: pattern-based approximation for data parallel applications , 2014, ASPLOS.

[9]  Glenn Reinman,et al.  BRAINIAC: Bringing reliable accuracy into neurally-implemented approximate computing , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[10]  John Sartori,et al.  Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications , 2012, IEEE Transactions on Multimedia.

[11]  Jacob Nelson,et al.  SNNAP: Approximate computing on programmable SoCs via neural acceleration , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[12]  Zheng Li,et al.  Continuous real-world inputs can open up alternative accelerator designs , 2013, ISCA.

[13]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[14]  M. Valero,et al.  Fuzzy memoization for floating-point multimedia applications , 2005, IEEE Transactions on Computers.

[15]  Luis Ceze,et al.  General-purpose code acceleration with limited-precision analog computation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[16]  Douglas L. Jones,et al.  Scalable stochastic processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[17]  Mohammad Zubair,et al.  High Performance Implementation of an Econometrics and Financial Application on GPUs , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[18]  Olivier Temam,et al.  Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[19]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[20]  Sherief Reda,et al.  ABACUS: A technique for automated behavioral synthesis of approximate computing circuits , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[21]  Glenn Reinman,et al.  Accelerating divergent applications on SIMD architectures using neural networks , 2014, ICCD.

[22]  Donald Yeung,et al.  Exploiting Application-Level Correctness for Low-Cost Fault Tolerance , 2008, J. Instr. Level Parallelism.

[23]  Martin C. Rinard,et al.  Verifying quantitative reliability for programs that execute on unreliable hardware , 2013, OOPSLA.

[24]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[25]  Luis Ceze,et al.  Architecture support for disciplined approximate programming , 2012, ASPLOS XVII.

[26]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[27]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[28]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[29]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Joel C. Huegel,et al.  Inverse Kinematics Solution for Robotic Manipulators Using a CUDA-Based Parallel Genetic Algorithm , 2011, MICAI.

[31]  Mohammad Zubair,et al.  A High Performance Implementation of Likelihood Estimators on GPUs , 2013 .

[32]  Babak Falsafi,et al.  Toward Dark Silicon in Servers , 2011, IEEE Micro.

[33]  Scott A. Mahlke,et al.  SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Huawei Li,et al.  A Fault Criticality Evaluation Framework of Digital Systems for Error Tolerant Video Applications , 2011, 2011 Asian Test Symposium.

[35]  John Banning,et al.  : An Efficient , 2022 .

[36]  Vicky Wong,et al.  Soft Error Resilience of Probabilistic Inference Applications , 2006 .

[37]  Kaushik Roy,et al.  ASLAN: Synthesis of approximate sequential circuits , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[38]  Lingamneni Avinash,et al.  Synthesizing Parsimonious Inexact Circuits through Probabilistic Design Techniques , 2013, TECS.

[39]  Kia Bazargan,et al.  Axilog: Language support for approximate hardware design , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[40]  Richard M. Karp,et al.  Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling , 2012, CF '12.

[41]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[42]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[43]  Jose-Maria Arnau,et al.  Eliminating redundant fragment shader executions on a mobile GPU via hardware memoization , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[44]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[45]  K. Sankaralingam,et al.  Exploring the Synergy of Emerging Workloads and Silicon Reliability Trends , 2009 .

[46]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[47]  Jacob Nelson,et al.  Approximate storage in solid-state memories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[48]  Kunle Olukotun,et al.  EMEURO: A framework for generating multi-purpose accelerators via deep learning , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[49]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[50]  Naresh R. Shanbhag,et al.  Energy-efficient signal processing via algorithmic noise-tolerance , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[51]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[52]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[53]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[54]  Kaushik Roy,et al.  SALSA: Systematic logic synthesis of approximate circuits , 2012, DAC Design Automation Conference 2012.

[55]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[56]  Henry Hoffmann,et al.  Quality of service profiling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[57]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[58]  Kaushik Roy,et al.  Quality programmable vector processors for approximate computing , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[59]  Xin Zhang,et al.  FlexJava: language support for safe and modular approximate programming , 2015, ESEC/SIGSOFT FSE.

[60]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.