Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed

Cycle-level micro architectural simulation is the de-facto standard to estimate performance of next-generation platforms. Unfortunately, the level of detail needed for accurate simulation requires complex, and therefore slow, simulation models that run at speeds that are thousands of times slower than native execution. With the introduction of sampled simulation, it has become possible to simulate only the key, representative portions of a workload in a reasonable amount of time and reliably estimate its overall performance. These sampling methodologies provide the ability to identify regions for detailed execution, and through micro architectural state check pointing, one can quickly and easily determine the performance characteristics of a workload for a variety of micro architectural changes. While this strategy of sampling simulations to generate checkpoints performs well for static applications, more complex scenarios involving hardware-software co-design (such as co-optimizing both a Java virtual machine and the micro architecture it is running on) cause this methodology to break down, as new micro architectural checkpoints are needed for each memory hierarchy configuration and software version. Solutions are therefore needed to enable fast and accurate simulation that also address the needs of hardware-software co-design and exploration. In this work we present a methodology to enhance checkpoint-based sampled simulation. Our solution integrates hardware virtualization to provide near-native speed, virtualized fast-forwarding to regions of interest, and parallel detailed simulation. However, as we cannot warm the simulated caches during virtualized fast-forwarding, we develop a novel approach to estimate the error introduced by limited cache warming, through the use of optimistic and pessimistic warming simulations. Using virtualized fast-forwarding (which operates at 90% of native speed on average), we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. Additionally, we demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.

[1]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[2]  Kim M. Hazelwood,et al.  SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[3]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[4]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[5]  Lieven Eeckhout,et al.  Efficient Sampling Startup for SimPoint , 2006, IEEE Micro.

[6]  Paolo Faraboschi,et al.  Combining Simulation and Virtualization through Dynamic Sampling , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[7]  Dirk Grunwald,et al.  Shadow Profiling: Hiding Instrumentation Costs with Parallelism , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[8]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[9]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[10]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[11]  Lieven Eeckhout,et al.  Self-monitored adaptive cache warm-up for microprocessor simulation , 2004, 16th Symposium on Computer Architecture and High Performance Computing.

[12]  Babak Falsafi,et al.  ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs , 2009, TRETS.

[13]  Mohamed F. Mokbel,et al.  SARD: A statistical approach for ranking database tuning parameters , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[14]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[15]  Shelley Chen Direct SMARTS : Accelerating Microarchitectural Simulation Through Direct Execution , 2004 .

[16]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[17]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[18]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[19]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[20]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[21]  Lieven Eeckhout,et al.  VSim: Simulating multi-server setups at near native hardware speed , 2012, TACO.

[22]  Thomas F. Wenisch,et al.  TurboSMARTS: accurate microarchitecture simulation sampling in minutes , 2005, SIGMETRICS '05.

[23]  Matt T. Yourst PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[24]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[25]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .