GPU Acceleration for Simulating Massively Parallel Many-Core Platforms

Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation technologies are exceedingly slow and the need to model full system many-core architectures adds further to the complexity issues. This paper presents a fast, scalable and parallel simulator, which uses a novel methodology to accelerate the simulation of a many-core coprocessor using GPU platforms. The main idea is to use. The target architecture of the associated. Simulation of many target nodes is mapped to the many hardware-threads available on highly parallel GPU platforms. This paper presents a novel methodology to accelerate the simulation of many-core coprocessors using GPU platforms. We demonstrate the challenges, feasibility and benefits of our idea to use heterogeneous system (CPU and GPU) to simulate future architecture of many-core heterogeneous platforms. The target architecture selected to evaluate our methodology consists of an ARM general purpose CPU coupled with many-core coprocessor with thousands of simple in-order cores connected in a tile network. This work presents optimization techniques used to parallelize the simulation specifically for acceleration on GPUs. We partition the full system simulation between CPU and GPU, where the target general purpose CPU is simulated on the host CPU, whereas the many-core coprocessor is simulated on the NVIDIA Tesla 2070 GPU platform. Our experiments show performance of up to 50 MIPS when simulating the entire heterogeneous chip, and high scalability with increasing cores on coprocessor.

[1]  Luca Benini,et al.  Supporting OpenMP on a multi-cluster embedded MPSoC , 2011, Microprocess. Microsystems.

[2]  Report,et al.  Public International Benchmarks for Parallel Computers , 1993 .

[3]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[4]  David A. Bader,et al.  Guest Editor's Introduction: Special Issue on High-Performance Computing with Accelerators , 2011, IEEE Trans. Parallel Distributed Syst..

[5]  Laxmikant V. Kalé,et al.  BigSim: a parallel simulator for performance prediction of extremely large parallel machines , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[6]  Coniferous softwood GENERAL TERMS , 2003 .

[7]  Valeria Bertacco,et al.  Event-driven gate-level simulation with GP-GPUs , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[8]  Leonid Ryzhyk,et al.  The ARM Architecture , 2006 .

[9]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[10]  James E. Smith,et al.  Modeling superscalar processors via statistical simulation , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[11]  Sreedhar B. Kodali,et al.  The Asynchronous Partitioned Global Address Space Model , 2010 .

[12]  Michael Adler,et al.  HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[13]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[14]  Kai Lu,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011, Journal of Computer Science and Technology.

[15]  Luca Benini,et al.  Analysis of Evolving SoC Interconnect Protocols , 2004 .

[16]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[18]  Arturo González-Escribano,et al.  The OpenMP source code repository , 2005, 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing.

[19]  Vladimir Getov,et al.  PARKBENCH Report -1: Public International Benchmarks for Parallel Computers, Technical Report: UT-CS-93-213 , 1994 .

[20]  Mateo Valero,et al.  From Plasma to BeeFarm: Design Experience of an FPGA-Based Multicore Prototype , 2011, ARC.

[21]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Wang Zhiqiang,et al.  Using GPU to Accelerate Cache Simulation , 2009, 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[23]  Babak Falsafi,et al.  ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs , 2009, TRETS.

[24]  Christoforos E. Kozyrakis,et al.  RAMP: Research Accelerator for Multiple Processors , 2007, IEEE Micro.

[25]  Luca Benini,et al.  Scalable instruction set simulator for thousand-core architectures running on GPGPUs , 2010, 2010 International Conference on High Performance Computing & Simulation.

[26]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[27]  Franco Fummi,et al.  A timing-accurate HW/SW cosimulation of an ISS with SystemC , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..

[28]  Luciano Lavagno,et al.  Software performance estimation strategies in a system-level design tool , 2000, Proceedings of the Eighth International Workshop on Hardware/Software Codesign. CODES 2000 (IEEE Cat. No.00TH8518).

[29]  Sunil P. Khatri,et al.  Towards acceleration of fault simulation using Graphics Processing Units , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[30]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[31]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[32]  David Defour,et al.  Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[33]  Luca Benini,et al.  Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting , 2012, GPGPU-5.

[34]  David I. August,et al.  Exploiting parallelism and structure to accelerate the simulation of chip multi-processors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[35]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[36]  Luca Benini,et al.  GPGPU-Accelerated Parallel and Fast Simulation of Thousand-Core Platforms , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[37]  Matt T. Yourst PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[38]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[39]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[40]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[41]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[42]  Shekhar Y. Borkar,et al.  Thousand Core ChipsA Technology Perspective , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[43]  П. Довгалюк,et al.  Два способа организации механизма полносистемного детерминированного воспроизведения в симуляторе QEMU , 2012 .