Producing Reliable Full-System Simulation Results: A Case Study of CMP with Very Large Caches

The greater detail and improved realism of fullsystem architecture simulation makes it a valuable computer architecture design tool. However, its unique characteristics introduce new sources of simulation variability which could make the results of such simulations less reliable. Meanwhile, the demand for more levels of cache and larger caches has increased to improve the system power and performance. This paper presents techniques to produce reliable results in fullsystem simulation of CMP computer systems with large caches. Specifically, we propose the detailed emulation replay warmup technique to deal with cold or incompletely warmed up large caches. We also propose the region of interest synchronization technique to prevent simulating non-representative phase when running multi-program workloads. Furthermore, we quantify the variation reduction one can achieve when using processor affinity and checkpointing. Finally, we show that by applying all four of these simulation techniques, the simulation variability is limited to less than 1% and the simulation results are therefore more reliable.

[1]  Alejandro Duran,et al.  Trace-driven simulation of multithreaded applications , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[2]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[3]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[4]  Aamer Jaleel,et al.  Analyzing Parallel Programs with PIN , 2010, Computer.

[5]  Kevin Skadron,et al.  Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation , 2003, 2003 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS 2003..

[6]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[7]  Douglas M. Hawkins,et al.  Characterizing and comparing prevailing simulation techniques , 2005, 11th International Symposium on High-Performance Computer Architecture.

[8]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[9]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[10]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[11]  Jan Van Campenhout,et al.  Runtime variability in scientific parallel applications , 2008 .

[12]  Balaram Sinharoy,et al.  POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).

[13]  Min Xu,et al.  Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .

[14]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[15]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[16]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[17]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[18]  Amir Roth,et al.  FIESTA: A Sample-Balanced Multi-Program Workload Methodology , 2009 .

[19]  Chong-Min Kyung,et al.  Thermal-aware energy minimization of 3D-stacked L3 cache with error rate limitation , 2011, 2011 IEEE International Symposium of Circuits and Systems (ISCAS).

[20]  David A. Wood,et al.  IPC Considered Harmful for Multiprocessor Workloads , 2006, IEEE Micro.

[21]  David J. Lilja,et al.  Simulation of computer architectures: simulators, benchmarks, methodologies, and recommendations , 2006, IEEE Transactions on Computers.

[22]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[23]  Lieven Eeckhout,et al.  Deformable Surface 3D Reconstruction from Monocular Images , 2010 .

[24]  Lieven Eeckhout,et al.  Computer Architecture Performance Evaluation Methods , 2010, Computer Architecture Performance Evaluation Methods.

[25]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[26]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).