Modeling Out-of-Order Superscalar Processor Performance Quickly and Accurately with Traces

Fast and accurate processor simulation is essential in processor design. Trace-driven simulation is a widely practiced fast simulation method. However, serious accuracy issues arise when an out-of-order superscalar processor is considered. In this thesis, trace-driven simulation methods are suggested to quickly and accurately model out-of-order superscalar processor performance with reduced traces. The approaches abstract the processor core and focus on the processor's uncore events rather than the processor's internal events. As a result, fast simulation speed is achieved while maintaining fairly small error compared with an execution-driven simulator. Traces can be generated either by a cycle-accurate simulator or an abstract timing model on top of a simple functional simulator. Simulation results are more accurate with the method using traces generated from a cycle-accurate simulator. Faster trace generation speed is achieved with the abstract timing model. The methods determine how to treat a cache miss with respect to other cache misses recorded in the trace by dynamically reconstructing the reorder buffer state during simulation and honoring the dependencies between the trace items. This approach preserves a processor's dynamic uncore access patterns and accurately predicts the relative performance change when the processor's uncore-level parameters are changed. The methods are attractive especially in the early design stages due to its fast simulation speed.

[1]  Mikko H. Lipasti,et al.  Can trace-driven simulators accurately predict superscalar performance? , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[2]  Yannis Smaragdakis,et al.  Flexible reference trace reduction for VM simulations , 2003, TOMC.

[3]  David B. Papworth Tuning the Pentium Pro microarchitecture , 1996, IEEE Micro.

[4]  R.H. Katz,et al.  A characterization of sharing in parallel programs and its application to coherency protocol evaluation , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[5]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[6]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[7]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[8]  Sangyeun Cho,et al.  Accurately approximating superscalar processor performance from traces , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[9]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[10]  Pradeep Dubey,et al.  Platform 2015: Intel ® Processor and Platform Evolution for the Next Decade , 2005 .

[11]  Lieven Eeckhout,et al.  Memory Data Flow Modeling in Statistical Simulation for the Efficient Exploration of Microprocessor Design Spaces , 2008, IEEE Transactions on Computers.

[12]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Kevin Skadron,et al.  CMP design space exploration subject to physical constraints , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[14]  Susan J. Eggers,et al.  Techniques for efficient inline tracing on a shared-memory multiprocessor , 1990, SIGMETRICS '90.

[15]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..

[16]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .

[17]  Per Stenström,et al.  Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[18]  Gabriel H. Loh A time-stamping algorithm for efficient performance estimation of superscalar processors , 2001, SIGMETRICS '01.

[19]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[20]  Rastislav Bodík,et al.  Focusing processor policies via critical-path prediction , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[21]  Maurice V. Wilkes,et al.  The memory wall and the CMOS end-point , 1995, CARN.

[22]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[23]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[24]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[25]  Sangyeun Cho,et al.  In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[26]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX Annual Technical Conference, FREENIX Track.

[27]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[28]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[29]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[30]  Philip Bitar,et al.  A Critique of Trace-Driven Simulation for Shared-Memory Multiprocessors , 1990 .

[31]  John L. Hennessy,et al.  Efficient performance prediction for modern microprocessors , 2000, SIGMETRICS '00.

[32]  Rami G. Melhem,et al.  Scalable Multi-cache Simulation Using GPUs , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[33]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[34]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[35]  Lizy Kurian John,et al.  Synthesizing memory-level parallelism aware miniature clones for SPEC CPU2006 and ImplantBench workloads , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[36]  James E. Smith,et al.  The future of simulation: a field of dreams , 2006, Computer.

[37]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[38]  James E. Smith,et al.  Statistical Simulation: Adding Efficiency to the Computer Designer's Toolbox , 2003, IEEE Micro.

[39]  John Paul Shen,et al.  A framework for statistical modeling of superscalar processor performance , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[40]  Mikko H. Lipasti,et al.  Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[41]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[42]  James E. Smith,et al.  Statistical simulation of symmetric multiprocessor systems , 2002, Proceedings 35th Annual Simulation Symposium. SS 2002.

[43]  Rastislav Bodík,et al.  Interaction cost and shotgun profiling , 2004, TACO.

[44]  Thomas F. Wenisch,et al.  Simulation sampling with live-points , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[45]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[46]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[47]  Leslie A. Barnes Performance Modeling and Analysis for AMD's High Performance Microprocessors , 2007, ISPASS.

[48]  Lieven Eeckhout,et al.  Measuring benchmark similarity using inherent program characteristics , 2006, IEEE Transactions on Computers.

[49]  Stéphan Jourdan,et al.  An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors , 2004, International Journal of Parallel Programming.

[50]  Tor M. Aamodt,et al.  Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[51]  Thomas F. Wenisch,et al.  An Evaluation of Stratified Sampling of Microarchitecture Simulations , 2004 .

[52]  Burzin A. Patel,et al.  Optimization of instruction fetch mechanisms for high issue rates , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[53]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[54]  James E. Smith,et al.  Modeling superscalar processors via statistical simulation , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[55]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[56]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[57]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[58]  Sangyeun Cho,et al.  Accurately modeling superscalar processor performance with reduced trace , 2013, J. Parallel Distributed Comput..

[59]  Mayan Moudgill,et al.  Environment for PowerPC microarchitecture exploration , 1999, IEEE Micro.

[60]  John L. Hennessy,et al.  The accuracy of trace-driven simulations of multiprocessors , 1993, SIGMETRICS '93.

[61]  Anant Agarwal,et al.  Blocking: exploiting spatial locality for trace compaction , 1990, SIGMETRICS '90.

[62]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[63]  Louise Trevillyan,et al.  Representative traces for processor models with infinite cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[64]  Hyunjin Lee,et al.  Two‐phase trace‐driven simulation (TPTS): a fast multicore processor architecture simulation approach , 2010, Softw. Pract. Exp..

[65]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[66]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[67]  Michel Dubois,et al.  Cache inclusion and processor sampling in multiprocessor simulations , 1993, SIGMETRICS '93.

[68]  Pascal Sainrat,et al.  Multiple-block ahead branch predictors , 1996, ASPLOS VII.

[69]  James R. Larus,et al.  Efficient program tracing , 1993, Computer.

[70]  Stijn Eyerman,et al.  Interval simulation: Raising the level of abstraction in architectural simulation , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[71]  Sangyeun Cho I-cache multi-banking and vertical interleaving , 2007, GLSVLSI '07.

[72]  Lesley Anne Polka Package Technology to Address the Memory Bandwidth Challenge for Terascale Computing , 2007 .

[73]  James E. Smith,et al.  Advanced Micro Devices , 2005 .

[74]  Wen-Hann Wang,et al.  Efficient trace-driven simulation methods for cache performance analysis , 1991, TOCS.

[75]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[76]  Hyunjin Lee,et al.  CloudCache: Expanding and shrinking private caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.