Impacts of Non-blocking Caches in Out-of-order Processors

 Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Non-blocking cache; MSHR; Out-of-order Processors Non-blocking caches are an effective technique for tolerating cache-miss latency. They can reduce miss-induced processor stalls by buffering the misses and continuing to serve other independent access requests. Previous research on the complexity and performance of non-blocking caches supporting non-blocking loads showed they could achieve significant performance gains in comparison to blocking caches. However, those experiments were performed with benchmarks that are now over a decade old. Furthermore the processor that was simulated was a single-issue processor with unlimited run-ahead capability, a perfect branch predictor, fixed 16-cycle memory latency, single-cycle latency for floating point operations, and write-through and write-no-allocate caches. These assumptions are very different from today's high performance out-of-order processors such as the Intel Nehalem. Thus, it is time to re-evaluate the performance impact of non-blocking caches on practical out-of-order processors using up-to-date benchmarks. In this study, we evaluate the impacts of non-blocking data caches using the latest SPECCPU2006 benchmark suite on practical high performance out-of-order (OOO) processors. Simulations show that a data cache that supports hit-under-2-misses can provide a 17.76% performance gain for a typical high performance OOO processor running the SPECCPU 2006 benchmarks in comparison to a similar machine with a blocking cache. External Posting Date: July 06, 2011 [Fulltext] Approved for External Publication Internal Posting Date: July 06, 2011 [Fulltext] Copyright 2011 Hewlett-Packard Development Company, L.P. 1 Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li, Ke Chen, Jay B. Brockman, Norman P. Jouppi Hewlett-Packard Labs, University of Notre Dame † {sheng.li4, norm.jouppi}@hp.com, ‡ {kchen2, jbb}@nd.edu

[1]  Rajesh Kumar,et al.  A family of 45nm IA processors , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[2]  Marc Tremblay,et al.  A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[3]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[4]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[5]  Lizy Kurian John,et al.  Generation, Validation and Analysis of SPEC CPU2006 Simulation Points Based on Branch, Memory and TLB Characteristics , 2009, SPEC Benchmark Workshop.

[6]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[7]  Josep Torrellas,et al.  Scalable Cache Miss Handling for High Memory-Level Parallelism , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8]  John L. Henning Performance counters and development of SPEC CPU2006 , 2007, CARN.

[9]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[10]  Magnus Jahre,et al.  Performance Effects of a Cache Miss Handling Architecture in a Multi-core Processor , 2007 .