Approximate Memory Compression

Memory subsystems are a major energy bottleneck in computing platforms due to frequent transfers between processors and off-chip memory. We propose approximate memory compression, a technique that leverages the intrinsic resilience of emerging workloads such as machine learning and data analytics to reduce off-chip memory traffic, thereby improving energy and performance. We realize approximate memory compression by enhancing the memory controller to be aware of approximate memory regions—regions in memory that contain approximation-resilient data—and to transparently compress (decompress) the data written to (read from) these regions. To provide control over approximations, each approximate memory region is associated with an error constraint such as the maximum error that may be introduced in each data element. The quality-aware memory controller subjects memory transactions to a compression scheme that introduces approximations, thereby reducing memory traffic, while adhering to the specified error constraint for each approximate memory region. A software interface is provided to allow programmers to identify data structures (DSs) that are resilient to approximations. A runtime quality control framework automatically determines the error constraints for the identified DSs such that a given target application-level quality is maintained. We evaluate our proposal by applying it to three different main memory technologies in the context of a general-purpose computing system—DDR3 DRAM, LPDDR3 DRAM, and spin-transfer torque magnetic RAM (STT-MRAM). To demonstrate the feasibility of the proposed concepts, we also implement a hardware prototype using the Intel UniPHY-DDR3 memory controller and Nios-II processor, a Hynix DDR3 DRAM module, and a Stratix-IV field-programmable gate array (FPGA) development board. Across a wide range of machine learning benchmarks, approximate memory compression obtains significant benefits in main memory energy ( $1.18\times $ for DDR3 DRAM, $1.52\times $ for LPDDR3 DRAM, and $2.0\times $ for STT-MRAM) and a simultaneous improvement in execution time (5.2% for DDR3 DRAM, 5.4% for LPDDR3 DRAM, and 9.3% for STT-MRAM) with nearly identical application output quality.

[1]  Pradeep Dubey,et al.  Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications , 2008, Proceedings of the IEEE.

[2]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[3]  Kaushik Roy,et al.  STAxCache: An approximate, energy efficient STT-MRAM cache , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[4]  Martin Burtscher,et al.  Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[5]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[6]  Natalie D. Enright Jerger,et al.  The Bunker Cache for spatio-value approximation , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[8]  Mario Badr,et al.  Load Value Approximation , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  André Seznec,et al.  Decoupled zero-compressed memory , 2011, HiPEAC.

[10]  Kaushik Roy,et al.  Approximate storage for energy efficient spintronic memories , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Wayne H. Wolf,et al.  SAMC: a code compression algorithm for embedded processors , 1999, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[12]  Kaushik Roy,et al.  Analysis and characterization of inherent application resilience for approximate computing , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Norbert Wehn,et al.  Efficient reliability management in SoCs - an approximate DRAM perspective , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).

[14]  Nikil D. Dutt,et al.  QuARK: Quality-configurable approximate STT-MRAM cache by fine-grained tuning of reliability-energy knobs , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[15]  Jae W. Lee,et al.  eDRAM-based Tiered-Reliability Memory with applications to low-power frame buffers , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[16]  Qiang Xu,et al.  Approximate Computing: A Survey , 2016, IEEE Design & Test.

[17]  M. Ekman,et al.  A robust main-memory compression scheme , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[18]  Scott A. Mahlke,et al.  Concise loads and stores: The case for an asymmetric compute-memory architecture for approximation , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Anand Raghunathan,et al.  Computing in Memory With Spin-Transfer Torque Magnetic RAM , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[20]  Nikil D. Dutt,et al.  Exploiting Partially-Forgetful Memories for Approximate Computing , 2015, IEEE Embedded Systems Letters.

[21]  Jörg Henkel,et al.  CoCo: a hardware/software platform for rapid prototyping of code compression technique , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[22]  Arnab Raha,et al.  Quality Configurable Approximate DRAM , 2017, IEEE Transactions on Computers.

[23]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[24]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[25]  Anand Raghunathan,et al.  Approximate memory compression for energy-efficiency , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[26]  Kaushik Roy,et al.  A Priority-Based 6T/8T Hybrid SRAM Architecture for Aggressive Voltage Scaling in Video Applications , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Matthew Poremba,et al.  NVMain: An Architectural-Level Main Memory Simulator for Emerging Non-volatile Memories , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.

[28]  Luca Benini,et al.  Hardware-assisted data compression for energy minimization in systems with embedded processors , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[29]  Michael E. Wazlowski,et al.  Pinnacle: IBM MXT in a Memory Controller Chip , 2001, IEEE Micro.

[30]  Natalie D. Enright Jerger,et al.  Doppelgänger: A cache for approximate computing , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[32]  Onur Mutlu,et al.  Linearly compressed pages: A low-complexity, low-latency main memory compression framework , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Kaushik Roy,et al.  Approximate computing and the quest for computing efficiency , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[34]  J. Lucas,et al.  Sparkk : Quality-Scalable Approximate Storage in DRAM , 2014 .

[35]  Jacob Nelson,et al.  Approximate storage in solid-state memories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).