Improving Performance in Sub-Block Caches with Optimized Replacement Policies

Recent advances in computer processor design have led to the introduction of sub-blocking to cache architectures. Sub-block caches reduce the tag area and power overhead in caches without reducing the effective cache size by using fewer tags to index the full data RAM array. In spite of achieving reduced area and power overhead, sub-block caches suffer performance degradation due to cache trashing. This occurs when a wider cache line (super-block), made up of multiple valid cache lines (sub-blocks), is replaced or evicted when only a sub-block is to be allocated into the wider super-block. To address this problem, we propose cache replacement policies as they relate specifically to sub-block caches. We propose new replacement policies that are tuned for sub-block caches by adding more intelligence based on the valid state of individual sub-blocks of a super-block. We also investigate the effect of using a few level-0 registers to bypass a few level-1 cache pipe stages on sub-block cache performance. To evaluate the performance improvement offered by our proposed replacement policies and the use of level-0 registers, we developed a sub-block cache simulator based on the Simplescalar toolset for single-core evaluations and the Sniper Simulator for multicore evaluations. We show that, with minimal architectural updates to existing conventional cache replacement policies, we are able to improve level-1 cache hit rates by up to 4.17% using our proposed policies alone on SPEC2006 benchmarks and up to 14% in shared level-2 caches using multicore benchmark suites: PARSEC and SPLASH2.

[1]  Chia-Lin Yang,et al.  HotSpot cache: joint temporal and spatial locality exploitation for I-cache energy reduction , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[2]  Hao Wang,et al.  Design and Analysis of a Robust Pipelined Memory System , 2010, 2010 Proceedings IEEE INFOCOM.

[3]  M. Valero,et al.  Design and implementation of high-performance memory systems for future packet buffers , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[4]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[5]  Per Stenström,et al.  Improvement of energy-efficiency in off-chip caches by selective prefetching , 2002, Microprocess. Microsystems.

[6]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Alvin M. Despain,et al.  Cache design trade-offs for power and performance optimization: a case study , 1995, ISLPED '95.

[8]  Heiko Sparenberg,et al.  Introduction of eviction strategies for caching scalable media files , 2012, Seventh International Conference on Digital Information Management (ICDIM 2012).

[9]  Laxmi N. Bhuyan,et al.  A dynamic cache sub-block design to reduce false sharing , 1995, Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors.

[10]  Aleksandar Milenkovic,et al.  Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite , 2004, ACM-SE 42.

[11]  Lieven Eeckhout,et al.  Sniper: scalable and accurate parallel multi-core simulation , 2012 .

[12]  Hassan Ghasemzadeh,et al.  Modified pseudo LRU replacement algorithm , 2006, 13th Annual IEEE International Symposium and Workshop on Engineering of Computer-Based Systems (ECBS'06).

[13]  Mateo Valero,et al.  Design and Implementation of High-Performance Memory Systems for Future Packet Buffers , 2003, MICRO.

[14]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Yannis Smaragdakis,et al.  The EELRU adaptive replacement algorithm , 2003, Perform. Evaluation.

[16]  E. Sackinger,et al.  A single-chip, 1.6-billion, 16-b MAC/s multiprocessor DSP , 2000, IEEE Journal of Solid-State Circuits.

[17]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[18]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[19]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[20]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[21]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[22]  Nihar R. Mahapatra,et al.  The processor-memory bottleneck: problems and solutions , 1999, CROS.

[23]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[25]  Yuan Xie,et al.  Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Gabriel Moruz,et al.  Outperforming LRU via competitive analysis on parametrized inputs for paging , 2012, SODA.

[27]  장훈,et al.  [서평]「Computer Organization and Design, The Hardware/Software Interface」 , 1997 .

[28]  David A. Patterson,et al.  Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[29]  Norman P. Jouppi,et al.  CACTI 2.0: An Integrated Cache Timing and Power Model , 2002 .

[30]  Yan Solihin,et al.  CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[31]  Li Zhao,et al.  Exploring DRAM cache architectures for CMP server platforms , 2007, 2007 25th International Conference on Computer Design.

[32]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.