Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs

As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling performance in the early stages of processor design. Analytical models have been employed to rapidly search for higher performance designs, and can provide insights that detailed simulators may not. This paper proposes techniques to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance in the presence of long latency memory systems when employing hybrid analytical models that apply instruction trace analysis. Pending cache hits are secondary references to a cache block for which a request has already been initiated but has not yet completed. We find pending hits resulting from spatial locality and the fine-grained selection of instruction profile window blocks used for analysis both have non-negligible influences on the accuracy of hybrid analytical models and subsequently propose techniques to account for their effects. We then introduce techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. When modeling pending hits for a processor with unlimited outstanding misses we improve the accuracy of our baseline by a factor of 3.9, decreasing average error from 39.7% to 10.3%. When modeling a processor with data prefetching, a limited number of MSHRs, or both, the techniques result in an average error of 13.8%, 9.5% and 17.8%, respectively.

[1]  John L. Hennessy,et al.  Efficient performance prediction for modern microprocessors , 2000, SIGMETRICS '00.

[2]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[3]  James E. Smith,et al.  Automated design of application specific superscalar processors: an analytical approach , 2007, ISCA '07.

[4]  Norman P. Jouppi,et al.  How useful are non-blocking loads, stream buffers and speculative execution in multiple issue processors? , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[5]  John Paul Shen,et al.  Theoretical modeling of superscalar processor performance , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  Craig B. Zilles Benchmark health considered harmful , 2001, CARN.

[7]  Josep Torrellas,et al.  Scalable Cache Miss Handling for High Memory-Level Parallelism , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[9]  Martin C. Carlisle,et al.  Olden: parallelizing programs with dynamic data structures on distributed-memory machines , 1996 .

[10]  Stéphan Jourdan,et al.  Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[11]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[13]  John E. McDonald,et al.  Storage Hierarchy Optimization Procedure , 1975, IBM J. Res. Dev..

[14]  Mark Horowitz,et al.  An analytical cache model , 1989, TOCS.

[15]  Tor M. Aamodt,et al.  An Improved Analytical Superscalar Microprocessor Memory Model , 2008 .

[16]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[17]  Stijn Eyerman,et al.  Analytical performance analysis and modeling of superscalar and multi-threaded processors , 2008 .

[18]  Tejas Karkhanis,et al.  Automated design of application-specific superscalar processors , 2006 .

[19]  C. K. Chow,et al.  On Optimization of Storage Hierarchies , 1974, IBM J. Res. Dev..

[20]  Philippe Roussel,et al.  The microarchitecture of the intel pentium 4 processor on 90nm technology , 2004 .

[21]  David R. Kaeli,et al.  A discussion on non-blocking/lockup-free caches , 1996, CARN.

[22]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[23]  Thomas F. Wenisch,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, ISCA '03.

[24]  John Paul Shen,et al.  A framework for statistical modeling of superscalar processor performance , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[25]  Mateo Valero,et al.  Toward kilo-instruction processors , 2004, TACO.

[26]  Stéphan Jourdan,et al.  An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.

[27]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[28]  James E. Smith,et al.  A performance counter architecture for computing accurate CPI components , 2006, ASPLOS XII.

[29]  David I. August,et al.  Microarchitectural exploration with Liberty , 2002, MICRO 35.

[30]  James E. Smith,et al.  The future of simulation: a field of dreams , 2006, Computer.

[31]  Trevor N. Mudge,et al.  An Analytical Model for Designing Memory Hierarchies , 1996, IEEE Trans. Computers.

[32]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.