GpuTejas: A parallel simulator for GPU architectures

In this paper, we introduce a new Java-based parallel GPGPU simulator, GpuTejas. GpuTejas is a fast trace driven simulator, which uses relaxed synchronization, and non-blocking data structures to derive its speedups. Secondly, it introduces a novel scheduling and partitioning scheme for parallelizing a GPU simulator. We evaluate the performance of our simulator with a set of Rodinia benchmarks. We demonstrate a mean speedup of 17.33x with 64 threads over sequential execution, and a speedup of 429X over the widely used simulator GPGPU-Sim. We validated our timing and simulation model by comparing our results with a native system (NVIDIA Tesla M2070). As compared to the sequential version of GpuTejas, the parallel version has an error limited to <;7.67% for our suite of benchmarks, which is similar to the numbers reported by competing parallel simulators.

[1]  Rodrigo A. Vivanco,et al.  Scientific computing with Java and C++: a case study using functional magnetic resonance neuroimages , 2005, Softw. Pract. Exp..

[2]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[4]  Samuel P. Midkiff,et al.  Java programming for high-performance numerical computing , 2000, IBM Syst. J..

[5]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[6]  David Defour,et al.  Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[7]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[9]  Won Woo Ro,et al.  Parallel GPU architecture simulation framework exploiting work allocation unit parallelism , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Carlos González,et al.  ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[11]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Smruti R. Sarangi,et al.  Lock-Free and Wait-Free Slot Scheduling Algorithms , 2016, IEEE Transactions on Parallel and Distributed Systems.

[13]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[14]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Smruti R. Sarangi,et al.  ParTejas , 2017, ACM Trans. Model. Comput. Simul..

[16]  Andreas Moshovos,et al.  Characterizing the performance benefits of fused CPU/GPU systems using FusionSim , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[17]  Rodrigo A. Vivanco,et al.  Scientific computing with Java and Cpp: a case study using functional magnetic resonance neuroimages , 2005 .

[18]  Kevin Skadron,et al.  A reconfigurable simulator for large-scale heterogeneous multicore architectures , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[19]  J. Mark Bull,et al.  Benchmarking Java against C and Fortran for scientific applications , 2001, JGI '01.