Two-phase trace-driven simulation (TPTS): a fast multicore processor architecture simulation approach

Simulation is indispensable in computer architecture research. Researchers increasingly resort to detailed architecture simulators to identify performance bottlenecks, analyze interactions among different hardware and software components, and measure the impact of new design ideas on the system performance. However, the slow speed of conventional execution-driven architecture simulators is a serious impediment to obtaining desirable research productivity. This paper describes a novel fast multicore processor architecture simulation framework called Two-Phase Trace-driven Simulation (TPTS), which splits detailed timing simulation into a trace generation phase and a trace simulation phase. Much of the simulation overhead caused by uninteresting architectural events is only incurred once during the cycle-accurate simulation-based trace generation phase and can be omitted in the repeated trace-driven simulations. We report our experiences with tsim, an event-driven multicore processor architecture simulator that models detailed memory hierarchy, interconnect, and coherence protocol based on the TPTS framework. By applying aggressive event filtering, tsim achieves an impressive simulation speed of 146 millions of simulated instructions per second, when running 16-thread parallel applications. Copyright © 2010 John Wiley & Sons, Ltd.

[1]  Mikko H. Lipasti,et al.  Can trace-driven simulators accurately predict superscalar performance? , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[2]  Yannis Smaragdakis,et al.  Flexible reference trace reduction for VM simulations , 2003, TOMC.

[3]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..

[4]  Susan J. Eggers,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, ISCA.

[5]  David I. August,et al.  Exploiting parallelism and structure to accelerate the simulation of chip multi-processors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[6]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[7]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[8]  Sangyeun Cho,et al.  Accurately approximating superscalar processor performance from traces , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[9]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[10]  Guilherme Ottoni,et al.  Global Multi-Threaded Instruction Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[12]  Alan D. George,et al.  Parallel simulation of chip-multiprocessor architectures , 2002, TOMC.

[13]  Leslie A. Barnes Performance Modeling and Analysis for AMD's High Performance Microprocessors , 2007, ISPASS.

[14]  Thomas F. Wenisch,et al.  An Evaluation of Stratified Sampling of Microarchitecture Simulations , 2004 .

[15]  William J. Dally Interconnect-Centric Computing , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[16]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[17]  David J. Lilja,et al.  Measuring computer performance : A practitioner's guide , 2000 .

[18]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[19]  Paolo Faraboschi,et al.  An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters , 2008, ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software.

[20]  Babak Falsafi,et al.  A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs , 2008, FPGA '08.

[21]  Per Stenström,et al.  Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[22]  Krste Asanovic,et al.  Accelerating Multiprocessor Simulation with a Memory Timestamp Record , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[23]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[24]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[25]  Michel Dubois,et al.  Cache inclusion and processor sampling in multiprocessor simulations , 1993, SIGMETRICS '93.

[26]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[27]  Dam Sunwoo,et al.  FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators , 2007, MICRO.

[28]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[29]  Stephen R. Goldschmidt,et al.  Simulation of multiprocessors: accuracy and performance , 1993 .

[30]  Ravi R. Iyer,et al.  CQoS: a framework for enabling QoS in shared caches of CMP platforms , 2004, ICS '04.

[31]  Yan Solihin,et al.  A Framework for Providing Quality of Service in Chip Multi-Processors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[32]  Michael Zhang,et al.  Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors , 2005, ISCA 2005.

[33]  Tao Li,et al.  Guest Editors' Introduction: Interaction of Many-Core Computer Architecture and Operating Systems , 2008, IEEE Micro.

[34]  Mark D. Hill,et al.  Lamport clocks: verifying a directory cache-coherence protocol , 1998, SPAA '98.

[35]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[36]  Thomas F. Wenisch,et al.  Simulation sampling with live-points , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[37]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[38]  James E. Smith,et al.  The future of simulation: a field of dreams , 2006, Computer.

[39]  Anant Agarwal,et al.  Blocking: exploiting spatial locality for trace compaction , 1990, SIGMETRICS '90.

[40]  Thomas F. Wenisch,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, ISCA '03.

[41]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[42]  Jung Ho Ahn,et al.  How to simulate 1000 cores , 2009, CARN.

[43]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[44]  Jianwei Chen,et al.  SlackSim: a platform for parallel simulations of CMPs on CMPs , 2009, CARN.