Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads

Graphics processing units (GPUs) continue to grow in popularity for general-purpose, highly parallel, high-throughput systems. This has forced GPU vendors to increase their focus on general purpose workloads, sometimes at the expense of the graphics-specific workloads. Using GPUs for general-purpose computation is a departure from the driving forces behind programmable GPUs that were focused on a narrow subset of graphics rendering operations. Rather than focus on purely graphics-related or general-purpose use, we have designed and modeled an architecture that optimizes for both simultaneously to efficiently handle all GPU workloads. In this paper, we present Nyami, a co-optimized GPU architecture and simulation model with an open-source implementation written in Verilog. This approach allows us to more easily explore the GPU design space in a synthesizable, cycle-precise, modular environment. An instruction-precise functional simulator is provided for co-simulation and verification. Overall, we assume a GPU may be used as a general-purpose GPU (GPGPU) or a graphics engine and account for this in the architecture's construction and in the options and modules selectable for synthesis and simulation. To demonstrate Nyami's viability as a GPU research platform, we exploit its flexibility and modularity to explore the impact of a set of architectural decisions. These include sensitivity to cache size and associativity, barrel and switch-on-stall multithreaded instruction scheduling, and software vs. hardware implementations of rasterization. Through these experiments, we gain insight into commonly accepted GPU architecture decisions, adapt the architecture accordingly, and give examples of the intended use as a GPU research tool.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Hao Wang,et al.  Workload and power budget partitioning for single-chip heterogeneous processors , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[4]  John Y. Chen,et al.  GPU technology trends and future requirements , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[5]  K. Ramani,et al.  PowerRed : A Flexible Modeling Framework for Power Efficiency Exploration in GPUs , .

[6]  David Defour,et al.  Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[7]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Andrew E. Turner,et al.  Visualizing complex dynamics in many-core accelerator architectures , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[9]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[10]  Carlos González,et al.  ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[11]  Abdullah Al-Dujaili,et al.  Guppy: A GPU-like soft-core processor , 2012, 2012 International Conference on Field-Programmable Technology.

[12]  Pat Hanrahan,et al.  Designing graphics architectures around scalability and communication , 2001 .

[13]  Russell Tessier,et al.  FlexGrip: A soft GPGPU for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[14]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[15]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[16]  Todd M. Austin,et al.  Performance analysis using pipeline visualization , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[17]  J. Hoe,et al.  OpenSPARC : An Open Platform for Hardware Reliability Experimentation , 2008 .

[18]  Bin Li,et al.  Architecture comparisons between Nvidia and ATI GPUs: Computation parallelism and data communications , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[20]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).