PriME: A parallel and distributed simulator for thousand-core chips

Modern processors are integrating an increasing number of cores, which brings new design challenges. However, mainstream architectural simulators primarily focus on unicore or multicore systems with small core counts. In order to simulate emerging manycore architectures, more simulators designed for thousand-core systems are needed. In this paper, we introduce the Princeton Manycore Executor (PriME), a parallelized, MPI-based, x86 manycore simulator. The primary goal of PriME is to provide high performance simulation for manycore architectures, allowing fast exploration of architectural ideas including cache hierarchies, coherence protocols and NoCs in thousand-core systems. PriME supports multi-threaded workloads as well as multi-programmed workloads. Furthermore, it parallelizes the simulation of multi-threaded workloads inside of a host machine and parallelizes the simulation of multi-programmed workloads across multiple host machines by utilizing MPI to communicate between different simulator modules. By using two levels of parallelization (within a host machine and across host machines), PriME can improve simulation performance which is especially useful when simulating thousands of cores. Prime is especially adept at executing simulations which have large memory requirements as previous simulators which use multiple machines are unable to simulate more memory than is available in any single host machine. We demonstrate PriME simulating 2000+ core machines and show near-linear scaling on up to 108 host processors split across 9 machines. Finally we validate PriME against a real-world 40-core machine and show the average error to be 12%.

[1]  Jung Ho Ahn,et al.  How to simulate 1000 cores , 2009, CARN.

[2]  Sally A. McKee,et al.  Understanding PARSEC performance on contemporary CMPs , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Carl Ramey,et al.  TILE-Gx100 ManyCore processor: Acceleration interfaces and architecture , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[4]  Jianwei Chen,et al.  SlackSim: a platform for parallel simulations of CMPs on CMPs , 2009, CARN.

[5]  Sunil D. Sherlekar Intel Many Integrated Core (MIC) Architecture. , 2012, ICPADS 2012.

[6]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[7]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..

[8]  Scott Devine,et al.  Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[9]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[10]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[12]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[13]  Christopher J. Hughes,et al.  RSIM: Simulating Shared-Memory Multiprocessors with ILP Processors , 2002, Computer.

[14]  Babak Falsafi,et al.  ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs , 2009, TRETS.

[15]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[16]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Nanning Zheng,et al.  HORNET: A Cycle-Level Multicore Simulator , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[19]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[20]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[21]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX Annual Technical Conference, FREENIX Track.

[22]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[23]  Dam Sunwoo,et al.  FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators , 2007, MICRO.

[24]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[25]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[26]  Laxmikant V. Kalé,et al.  BigSim: a parallel simulator for performance prediction of extremely large parallel machines , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[27]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[28]  David I. August,et al.  Exploiting parallelism and structure to accelerate the simulation of chip multi-processors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[29]  Michael Adler,et al.  HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[30]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[31]  David A. Patterson,et al.  A case for FAME: FPGA architecture model execution , 2010, ISCA.

[32]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[33]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[34]  Eric A. Brewer,et al.  PROTEUS: a high-performance parallel-architecture simulator , 1992, SIGMETRICS '92/PERFORMANCE '92.

[35]  R. E. Kessler The Cavium 32 Core OCTEON II 68xx , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[36]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[37]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[38]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[39]  Mendel Rosenblum,et al.  Embra: fast and flexible machine simulation , 1996, SIGMETRICS '96.

[40]  Shobhit Kanaujia,et al.  FastMP: A Multi-core Simulation Methodology , 2006 .