Energy-efficient GPGPU architectures via collaborative compilation and memristive memory-based computing

Thousands of deep and wide pipelines working concurrently make GPGPU high power consuming parts. Energy-efficiency techniques employ voltage overscaling that increases timing sensitivity to variations and hence aggravating the energy use issues. This paper proposes a method to increase spatiotemporal reuse of computational effort by a combination of compilation and micro-architectural design. An associative memristive memory (AMM) module is integrated with the floating point units (FPUs). Together, we enable fine-grained partitioning of values and find high-frequency sets of values for the FPUs by searching the space of possible inputs, with the help of application-specific profile feedback. For every kernel execution, the compiler pre-stores these high-frequent sets of values in AMM modules - representing partial functionality of the associated FPU- that are concurrently evaluated over two clock cycles. Our simulation results show high hit rates with 32-entry AMM modules that enable 36% reduction in average energy use by the kernel codes. Compared to voltage overscaling, this technique enhances robustness against timing errors with 39% average energy saving.

[1]  Swarup Bhunia,et al.  Nanoscale reconfigurable computing using non-volatile 2-D STTRAM array , 2009, 2009 9th IEEE Conference on Nanotechnology (IEEE-NANO).

[2]  Luca Benini,et al.  Hierarchically Focused Guardbanding: An adaptive approach to mitigate PVT variations and aging , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[3]  Uri C. Weiser,et al.  Memristor-Based Multithreading , 2014, IEEE Computer Architecture Letters.

[4]  Meng-Fan Chang,et al.  A High-Speed 7.2-ns Read-Write Random Access 4-Mb Embedded Resistive RAM (ReRAM) Macro Using Process-Variation-Tolerant Current-Mode Read Schemes , 2013, IEEE Journal of Solid-State Circuits.

[5]  Luca Benini,et al.  Analysis of instruction-level vulnerability to dynamic voltage and temperature variations , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[6]  Trevor Mudge,et al.  A self-tuning DVS processor using delay-error detection and correction , 2005, VLSIC 2005.

[7]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[8]  Luca Benini,et al.  Temporal memoization for energy-efficient timing error recovery in GPGPUs , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  David Blaauw,et al.  Design Methodology for Voltage-Overscaled Ultra-Low-Power Systems , 2012, IEEE Transactions on Circuits and Systems II: Express Briefs.

[10]  Paolo A. Aseron,et al.  A 45 nm Resilient Microprocessor Core for Dynamic Variation Tolerance , 2011, IEEE Journal of Solid-State Circuits.

[11]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2014, IEEE Micro.

[12]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[13]  Amirali Ghofrani,et al.  Towards data reliable crossbar-based memristive memories , 2013, 2013 IEEE International Test Conference (ITC).

[14]  Luca Benini,et al.  Spatial Memoization: Concurrent Instruction Reuse to Correct Timing Errors in SIMD Architectures , 2013, IEEE Transactions on Circuits and Systems II: Express Briefs.

[15]  Narayan Srinivasa,et al.  A functional hybrid memristor crossbar-array/CMOS system for data storage and neuromorphic applications. , 2012, Nano letters.

[16]  Nariman Moezzi Madani,et al.  A 530mV 10-lane SIMD processor with variation resiliency in 45nm SOI , 2012, 2012 IEEE International Solid-State Circuits Conference.

[17]  Sanjay Pant,et al.  A self-tuning DVS processor using delay-error detection and correction , 2005, IEEE Journal of Solid-State Circuits.

[18]  Swarup Bhunia,et al.  Energy-Efficient Reconfigurable Computing Using a Circuit-Architecture-Software Co-Design Approach , 2011, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[19]  Jing Li,et al.  1 Mb 0.41 µm² 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing , 2014, IEEE Journal of Solid-State Circuits.