Accurately modeling superscalar processor performance with reduced trace

Trace-driven simulation of out-of-order superscalar processors is far from straightforward. The dynamic nature of out-of-order superscalar processors combined with the static nature of traces can lead to large inaccuracies in the results when the traces contain only a subset of executed instructions for trace reduction. In this paper, we describe and comprehensively evaluate the pairwise dependent cache miss model (PDCM), a framework for fast and accurate trace-driven simulation of out-of-order superscalar processors. The model determines how to treat a cache miss with respect to other cache misses recorded in the trace by dynamically reconstructing the reorder buffer state during simulation and honoring the dependencies between the trace items. Our experimental results demonstrate that a PDCM-based simulator produces highly accurate simulation results (less than 3% error) with fast simulation speeds (62.5x on average) compared with an execution-driven simulator. Moreover, we observed that the proposed simulation method is capable of preserving a processor's dynamic off-core memory access behavior and accurately predicting the relative performance change when a processor's low-level memory hierarchy parameters are changed.

[1]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[2]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[3]  Mikko H. Lipasti,et al.  Can trace-driven simulators accurately predict superscalar performance? , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[4]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[5]  James E. Smith,et al.  The future of simulation: a field of dreams , 2006, Computer.

[6]  Lieven Eeckhout,et al.  Memory Data Flow Modeling in Statistical Simulation for the Efficient Exploration of Microprocessor Design Spaces , 2008, IEEE Transactions on Computers.

[7]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[8]  Mikko H. Lipasti,et al.  Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[9]  Hyunjin Lee,et al.  Two‐phase trace‐driven simulation (TPTS): a fast multicore processor architecture simulation approach , 2010, Softw. Pract. Exp..

[10]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[11]  Michel Dubois,et al.  Cache inclusion and processor sampling in multiprocessor simulations , 1993, SIGMETRICS '93.

[12]  Leslie A. Barnes Performance Modeling and Analysis for AMD's High Performance Microprocessors , 2007, ISPASS.

[13]  Lieven Eeckhout,et al.  Measuring benchmark similarity using inherent program characteristics , 2006, IEEE Transactions on Computers.

[14]  Stéphan Jourdan,et al.  An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.

[15]  Tor M. Aamodt,et al.  Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[16]  Kevin Skadron,et al.  CMP design space exploration subject to physical constraints , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[17]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[18]  Sangyeun Cho,et al.  Accurately approximating superscalar processor performance from traces , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[19]  Lizy Kurian John,et al.  Synthesizing memory-level parallelism aware miniature clones for SPEC CPU2006 and ImplantBench workloads , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[20]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[21]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[22]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[23]  Donald Yeung,et al.  Hill-climbing SMT processor resource distribution , 2009, TOCS.

[24]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .

[25]  Hyunjin Lee,et al.  Two-phase trace-driven simulation (TPTS): a fast multicore processor architecture simulation approach , 2010 .

[26]  James E. Smith,et al.  Advanced Micro Devices , 2005 .

[27]  Stijn Eyerman,et al.  Interval simulation: Raising the level of abstraction in architectural simulation , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[28]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[29]  Sangyeun Cho,et al.  In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.