Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Non-blocking cache; MSHR; Out-of-order Processors Non-blocking caches are an effective technique for tolerating cache-miss latency. They can reduce miss-induced processor stalls by buffering the misses and continuing to serve other independent access requests. Previous research on the complexity and performance of non-blocking caches supporting non-blocking loads showed they could achieve significant performance gains in comparison to blocking caches. However, those experiments were performed with benchmarks that are now over a decade old. Furthermore the processor that was simulated was a single-issue processor with unlimited run-ahead capability, a perfect branch predictor, fixed 16-cycle memory latency, single-cycle latency for floating point operations, and write-through and write-no-allocate caches. These assumptions are very different from today's high performance out-of-order processors such as the Intel Nehalem. Thus, it is time to re-evaluate the performance impact of non-blocking caches on practical out-of-order processors using up-to-date benchmarks. In this study, we evaluate the impacts of non-blocking data caches using the latest SPECCPU2006 benchmark suite on practical high performance out-of-order (OOO) processors. Simulations show that a data cache that supports hit-under-2-misses can provide a 17.76% performance gain for a typical high performance OOO processor running the SPECCPU 2006 benchmarks in comparison to a similar machine with a blocking cache. External Posting Date: July 06, 2011 [Fulltext] Approved for External Publication Internal Posting Date: July 06, 2011 [Fulltext] Copyright 2011 Hewlett-Packard Development Company, L.P. 1 Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li, Ke Chen, Jay B. Brockman, Norman P. Jouppi Hewlett-Packard Labs, University of Notre Dame † {sheng.li4, norm.jouppi}@hp.com, ‡ {kchen2, jbb}@nd.edu
[1]
Rajesh Kumar,et al.
A family of 45nm IA processors
,
2009,
2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.
[2]
Marc Tremblay,et al.
A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor
,
2008,
2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.
[3]
Kunle Olukotun,et al.
Niagara: a 32-way multithreaded Sparc processor
,
2005,
IEEE Micro.
[4]
Brad Calder,et al.
Automatically characterizing large scale program behavior
,
2002,
ASPLOS X.
[5]
Lizy Kurian John,et al.
Generation, Validation and Analysis of SPEC CPU2006 Simulation Points Based on Branch, Memory and TLB Characteristics
,
2009,
SPEC Benchmark Workshop.
[6]
Ronald G. Dreslinski,et al.
The M5 Simulator: Modeling Networked Systems
,
2006,
IEEE Micro.
[7]
Josep Torrellas,et al.
Scalable Cache Miss Handling for High Memory-Level Parallelism
,
2006,
2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[8]
John L. Henning.
Performance counters and development of SPEC CPU2006
,
2007,
CARN.
[9]
Norman P. Jouppi,et al.
Complexity/performance tradeoffs with non-blocking loads
,
1994,
ISCA '94.
[10]
Magnus Jahre,et al.
Performance Effects of a Cache Miss Handling Architecture in a Multi-core Processor
,
2007
.