Scalable Multi-cache Simulation Using GPUs

Software simulation is the primary tool used for evaluation of processor design. Simulation offers better accuracy than analytical models and is an important evaluation step before actually fabricating a chip. Unfortunately, simulator speeds are slow -- a conventional cycle-accurate simulator will be unable to keep up with increasing core counts in modern processor design. Parallel simulation is one method for improving simulation speeds. Two major areas of parallel simulation research are multithreaded simulators and FPGAs as simulation accelerators. Multithreaded simulators can only extract coarse-grained parallelism and must sacrifice accuracy in order to scale well. FPGA-based simulators can extract fine-grained parallelism, but are expensive and difficult to program. We propose using GPUs for architectural simulation, which can take advantage of a high degree of fine-grained parallelism. In addition, they are inexpensive and easier to program compared to FPGAs. To demonstrate our ideas, we implement a trace-driven many-cache simulator using NVIDIA's CUDA toolkit. GPU-accelerated cache simulation displays remarkable scaling with number of simulated caches when compared to serial CPU-only simulation.

[1]  Trevor N. Mudge,et al.  Trace-driven memory simulation: a survey , 1997, CSUR.

[2]  Dam Sunwoo,et al.  FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators , 2007, MICRO.

[3]  Christoforos E. Kozyrakis,et al.  RAMP: Research Accelerator for Multiple Processors , 2007, IEEE Micro.

[4]  Jianwei Chen,et al.  SlackSim: a platform for parallel simulations of CMPs on CMPs , 2009, CARN.

[5]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[6]  Alan D. George,et al.  Parallel simulation of chip-multiprocessor architectures , 2002, TOMC.

[7]  Xiang Long,et al.  Using GPU to accelerate a pin-based multi-level cache simulator , 2010, SpringSim.

[8]  Jianwei Chen,et al.  SlackSim: a platform for parallel simulations of CMPs on CMPs , 2009, CARN.

[9]  David A. Patterson,et al.  RAMP: research accelerator for multiple processors - a community vision for a shared experimental parallel HW/SW platform , 2006, ISPASS.

[10]  Rami G. Melhem,et al.  Automated modeling and emulation of interconnect designs for many-core chip multiprocessors , 2010, Design Automation Conference.

[11]  Valeria Bertacco,et al.  Event-driven gate-level simulation with GP-GPUs , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[12]  Kalyan S. Perumalla Discrete-event Execution Alternatives on General Purpose Graphical Processing Units (GPGPUs) , 2006, 20th Workshop on Principles of Advanced and Distributed Simulation (PADS'06).

[13]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..

[14]  Hyunjin Lee,et al.  TPTS: A Novel Framework for Very Fast Manycore Processor Architecture Simulation , 2008, 2008 37th International Conference on Parallel Processing.

[15]  Michael Adler,et al.  HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[16]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[17]  Sandeep K. Shukla,et al.  SCGPSim: A fast SystemC simulator on GPUs , 2010, 2010 15th Asia and South Pacific Design Automation Conference (ASP-DAC).

[18]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[19]  Omer Khan,et al.  Darsim: A Parallel Cycle-Level NoC Simulator , 2010 .