Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics

The importance of open-source hardware and software has been increasing. However, despite GPUs being one of the more popular accelerators across various applications, there is very little open-source GPU infrastructure in the public domain. We argue that one of the reasons for the lack of open-source infrastructure for GPUs is rooted in the complexity of their ISA and software stacks. In this work, we first propose an ISA extension to RISC-V that supports GPGPUs and graphics. The main goal of the ISA extension proposal is to minimize the ISA changes so that the corresponding changes to the open-source ecosystem are also minimal, which makes for a sustainable development ecosystem. To demonstrate the feasibility of the minimally extended RISC-V ISA, we implemented the complete software and hardware stacks of Vortex on FPGA. Vortex is a PCIe-based soft GPU that supports OpenCL and OpenGL. Vortex can be used in a variety of applications, including machine learning, graph analytics, and graphics rendering. Vortex can scale up to 32 cores on an Altera Stratix 10 FPGA, delivering a peak performance of 25.6 GFlops at 200 Mhz.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Mike Mantor,et al.  AMD Radeon™ HD 7970 with graphics core next (GCN) architecture , 2012, 2012 IEEE Hot Chips 24 Symposium (HCS).

[3]  Bringing OpenCL to Commodity RISC-V CPUs , 2021 .

[4]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Yunsup Lee,et al.  A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators , 2014, ESSCIRC 2014 - 40th European Solid State Circuits Conference (ESSCIRC).

[6]  Hyesoon Kim,et al.  Tango: An Optimizing Compiler for Just-In-Time RTL Simulation , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Luca Benini,et al.  Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  Lars Bishop OpenGL ES 1.1, 2.0 and EGL , 2006, SIGGRAPH Courses.

[9]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[10]  Luca Benini,et al.  A multi-banked shared-l1 cache architecture for tightly coupled processor clusters , 2012, 2012 International Symposium on System on Chip (SoC).

[11]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13]  Sudhakar Yalamanchili,et al.  Lightweight SIMT core designs for intelligent 3D stacked DRAM , 2017, MEMSYS.

[14]  John Burgess,et al.  RTX ON – The NVIDIA TURING GPU , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[15]  Samuli Laine,et al.  High-performance software rasterization on GPUs , 2011, HPG '11.

[16]  Martin White,et al.  MIP-Map Level Selection for Texture Mapping , 1998, IEEE Trans. Vis. Comput. Graph..

[17]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[18]  John Wawrzynek,et al.  Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[19]  Jose Renau,et al.  Fluid Pipelines: Elastic Circuitry without Throughput Penalty , 2016 .

[20]  Ian Bratt,et al.  The ARM® Mali-T880 Mobile GPU , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[21]  Carlos González,et al.  ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[22]  Paolo Ienne,et al.  Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs , 2019, FPGA.

[23]  Russell Tessier,et al.  FlexGrip: A soft GPGPU for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[24]  Valerio Pascucci,et al.  RTX beyond ray tracing: exploring the use of hardware ray tracing cores for tet-mesh point location , 2019, High Performance Graphics.

[25]  Paolo Ienne,et al.  Elastic CGRAs , 2013, FPGA '13.

[26]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[27]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[28]  Homan Igehy,et al.  Prefetching in a texture cache architecture , 1998, Workshop on Graphics Hardware.

[29]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[30]  Arvind Bluespec: A language for hardware design, simulation, synthesis and verification Invited Talk , 2003, MEMOCODE.

[31]  Sylvain Collange,et al.  Simty: generalized SIMT execution on RISC-V , 2017 .

[32]  Hoi-Jun Yoo,et al.  Mobile 3D Graphics SoC: From Algorithm to Chip , 2010 .

[33]  Jason Helge Anderson,et al.  Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[34]  Marc Stamminger,et al.  CPU-style SIMD ray traversal on GPUs , 2018, High Performance Graphics.

[35]  Dieter Schmalstieg,et al.  On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing , 2018, PACMCGIT.

[36]  Fares Elsabbagh,et al.  Vortex: OpenCL Compatible RISC-V GPGPU , 2020, ArXiv.

[37]  Tor M. Aamodt,et al.  Emerald: Graphics Modeling for SoC Systems , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[38]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[39]  Ben Sander HSAIL: Portable compiler IR for HSA , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[40]  Aaron Carpenter,et al.  Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[41]  Timothy N. Miller,et al.  NyuziRaster: Optimizing rasterizer performance and energy in the Nyuzi open source GPU , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[42]  Michael Hübner,et al.  FGPU: An SIMT-Architecture for FPGAs , 2016, FPGA.

[43]  FengWu-chun,et al.  The Green500 List , 2007 .

[44]  Erik Brunvand,et al.  Mach-RT: a many chip architecture for ray tracing , 2019, High Performance Graphics.

[45]  J. Gregory Steffan,et al.  Efficient multi-ported memories for FPGAs , 2010, FPGA '10.

[46]  Karthikeyan Sankaralingam,et al.  MIAOW - An open source RTL implementation of a GPGPU , 2015, 2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII).

[47]  Matthew Poremba,et al.  Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[48]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[49]  Peter Bøgh Andersen Elastic Systems , 2001, INTERACT.