CRAM: Efficient Hardware-Based Memory Compression for Bandwidth Enhancement

This paper investigates hardware-based memory compression designs to increase the memory bandwidth. When lines are compressible, the hardware can store multiple lines in a single memory location, and retrieve all these lines in a single access, thereby increasing the effective memory bandwidth. However, relocating and packing multiple lines together depending on the compressibility causes a line to have multiple possible locations. Therefore, memory compression designs typically require metadata to specify the compressibility of the line. Unfortunately, even in the presence of dedicated metadata caches, maintaining and accessing this metadata incurs significant bandwidth overheads and can degrade performance by as much as 40%. Ideally, we want to implement memory compression while eliminating the bandwidth overheads of metadata accesses. This paper proposes CRAM, a bandwidth-efficient design for memory compression that is entirely hardware based and does not require any OS support or changes to the memory modules or interfaces. CRAM uses a novel implicit-metadata mechanism, whereby the compressibility of the line can be determined by scanning the line for a special marker word, eliminating the overheads of metadata access. CRAM is equipped with a low-cost Line Location Predictor (LLP) that can determine the location of the line with 98% accuracy. Furthermore, we also develop a scheme that can dynamically enable or disable compression based on the bandwidth cost of storing compressed lines and the bandwidth benefits of obtaining compressed lines, ensuring no degradation for workloads that do not benefit from compression. Our evaluations, over a diverse set of 27 workloads, show that CRAM provides a speedup of up to 73% (average 6%) without causing slowdown for any of the workloads, and consuming a storage overhead of less than 300 bytes at the memory controller.

[1]  Per Stenström,et al.  SC2: A statistical compression cache scheme , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[2]  Doe Hyun Yoon,et al.  Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[3]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[4]  Anant Agarwal,et al.  Column-associative caches: a technique for reducing the miss rate of direct-mapped caches , 1993, ISCA '93.

[5]  Per Stenström,et al.  HyComp: A hybrid cache compression method for selection of data-type-specific compression methods , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Jen-Shiun Chiang,et al.  Low-power way-predicting cache using valid-bit pre-decision for parallel architectures , 2005, 19th International Conference on Advanced Information Networking and Applications (AINA'05) Volume 1 (AINA papers).

[7]  Xi Chen,et al.  C-Pack: A High-Performance Microprocessor Cache Compression Algorithm , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[9]  M. Ekman,et al.  A robust main-memory compression scheme , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  Don Coppersmith,et al.  The Data Encryption Standard (DES) and its strength against attacks , 1994, IBM J. Res. Dev..

[11]  Xiaowei Shen,et al.  Performance of hardware compressed main memory , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[12]  Jun Yang,et al.  Frequent Value Locality and Value-Centric Data Cache Design , 2000, ASPLOS.

[13]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[14]  David Wentzlaff,et al.  MORC: A manycore-oriented compressed cache , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Rajeev Balasubramonian,et al.  MemZip: Exploring unconventional benefits from memory compression , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[16]  Seth H. Pugsley,et al.  USIMM : the Utah SImulated Memory Module , 2012 .

[17]  Onur Mutlu,et al.  Linearly compressed pages: A low-complexity, low-latency main memory compression framework , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[19]  Alaa R. Alameldeen,et al.  Base-Victim Compression: An Opportunistic Cache Compression Architecture , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[20]  André Seznec,et al.  Dictionary sharing: An efficient cache compression scheme for compressed caches , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Onur Mutlu,et al.  Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[23]  David A. Wood,et al.  Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches , 2004 .

[24]  Moinuddin K. Qureshi,et al.  DICE: Compressing DRAM caches for bandwidth and capacity , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[25]  Dirk Grunwald,et al.  Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[26]  Mattan Erez,et al.  Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures , 2016, ISCA.

[27]  Nam Sung Kim,et al.  Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Jun Yang,et al.  Frequent value compression in data caches , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[29]  Mark Horowitz,et al.  Cache performance of operating system and multiprogramming workloads , 1988, TOCS.

[30]  Paolo Faraboschi,et al.  Enabling technologies for memory compression: Metadata, mapping, and prediction , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[31]  Gabriel H. Loh,et al.  Thread-aware dynamic shared cache compression in multi-core processors , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[32]  André Seznec,et al.  Zero-content augmented caches , 2009, ICS '09.

[33]  Alberto Ros,et al.  PS-Cache: an energy-efficient cache design for chip multiprocessors , 2013, The Journal of Supercomputing.

[34]  Somayeh Sardashti,et al.  Decoupled compressed cache: Exploiting spatial locality for energy-optimized compressed caching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Jaehyuk Huh,et al.  Transparent Dual Memory Compression Architecture , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[36]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[37]  David A. Wood,et al.  Adaptive cache compression for high-performance processors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[38]  Daniel A. Jiménez,et al.  Reducing network-on-chip energy consumption through spatial locality speculation , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[39]  Kaushik Roy,et al.  Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[40]  Samira Manabi Khan,et al.  Last-level cache deduplication , 2014, ICS '14.

[41]  Christoforos E. Kozyrakis,et al.  Future scaling of processor-memory interfaces , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[42]  Somayeh Sardashti,et al.  Skewed Compressed Caches , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[43]  Mikko H. Lipasti,et al.  COP: To compress and protect main memory , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).