论文信息 - Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics

Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics

The importance of open-source hardware and software has been increasing. However, despite GPUs being one of the more popular accelerators across various applications, there is very little open-source GPU infrastructure in the public domain. We argue that one of the reasons for the lack of open-source infrastructure for GPUs is rooted in the complexity of their ISA and software stacks. In this work, we first propose an ISA extension to RISC-V that supports GPGPUs and graphics. The main goal of the ISA extension proposal is to minimize the ISA changes so that the corresponding changes to the open-source ecosystem are also minimal, which makes for a sustainable development ecosystem. To demonstrate the feasibility of the minimally extended RISC-V ISA, we implemented the complete software and hardware stacks of Vortex on FPGA. Vortex is a PCIe-based soft GPU that supports OpenCL and OpenGL. Vortex can be used in a variety of applications, including machine learning, graph analytics, and graphics rendering. Vortex can scale up to 32 cores on an Altera Stratix 10 FPGA, delivering a peak performance of 25.6 GFlops at 200 Mhz.

[1] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2] Mike Mantor,et al. AMD Radeon™ HD 7970 with graphics core next (GCN) architecture , 2012, 2012 IEEE Hot Chips 24 Symposium (HCS).

[3] Bringing OpenCL to Commodity RISC-V CPUs , 2021 .

[4] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5] Yunsup Lee,et al. A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators , 2014, ESSCIRC 2014 - 40th European Solid State Circuits Conference (ESSCIRC).

[6] Hyesoon Kim,et al. Tango: An Optimizing Compiler for Just-In-Time RTL Simulation , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7] Luca Benini,et al. Ara: A 1-GHz+ Scalable and Energy-Efficient RISC-V Vector Processor With Multiprecision Floating-Point Support in 22-nm FD-SOI , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8] Lars Bishop. OpenGL ES 1.1, 2.0 and EGL , 2006, SIGGRAPH Courses.

[9] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[10] Luca Benini,et al. A multi-banked shared-l1 cache architecture for tightly coupled processor clusters , 2012, 2012 International Symposium on System on Chip (SoC).

[11] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13] Sudhakar Yalamanchili,et al. Lightweight SIMT core designs for intelligent 3D stacked DRAM , 2017, MEMSYS.

[14] John Burgess,et al. RTX ON – The NVIDIA TURING GPU , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[15] Samuli Laine,et al. High-performance software rasterization on GPUs , 2011, HPG '11.

[16] Martin White,et al. MIP-Map Level Selection for Texture Mapping , 1998, IEEE Trans. Vis. Comput. Graph..

[17] Jie Cheng,et al. CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[18] John Wawrzynek,et al. Chisel: Constructing hardware in a Scala embedded language , 2012, DAC Design Automation Conference 2012.

[19] Jose Renau,et al. Fluid Pipelines: Elastic Circuitry without Throughput Penalty , 2016 .

[20] Ian Bratt,et al. The ARM® Mali-T880 Mobile GPU , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[21] Carlos González,et al. ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[22] Paolo Ienne,et al. Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs , 2019, FPGA.

[23] Russell Tessier,et al. FlexGrip: A soft GPGPU for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[24] Valerio Pascucci,et al. RTX beyond ray tracing: exploring the use of hardware ray tracing cores for tet-mesh point location , 2019, High Performance Graphics.

[25] Paolo Ienne,et al. Elastic CGRAs , 2013, FPGA '13.

[26] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[27] David A. Wood,et al. gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[28] Homan Igehy,et al. Prefetching in a texture cache architecture , 1998, Workshop on Graphics Hardware.

[29] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[30] Arvind. Bluespec: A language for hardware design, simulation, synthesis and verification Invited Talk , 2003, MEMOCODE.

[31] Sylvain Collange,et al. Simty: generalized SIMT execution on RISC-V , 2017 .

[32] Hoi-Jun Yoo,et al. Mobile 3D Graphics SoC: From Algorithm to Chip , 2010 .

[33] Jason Helge Anderson,et al. Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[34] Marc Stamminger,et al. CPU-style SIMD ray traversal on GPUs , 2018, High Performance Graphics.

[35] Dieter Schmalstieg,et al. On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing , 2018, PACMCGIT.

[36] Fares Elsabbagh,et al. Vortex: OpenCL Compatible RISC-V GPGPU , 2020, ArXiv.

[37] Tor M. Aamodt,et al. Emerald: Graphics Modeling for SoC Systems , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[38] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[39] Ben Sander. HSAIL: Portable compiler IR for HSA , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[40] Aaron Carpenter,et al. Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[41] Timothy N. Miller,et al. NyuziRaster: Optimizing rasterizer performance and energy in the Nyuzi open source GPU , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[42] Michael Hübner,et al. FGPU: An SIMT-Architecture for FPGAs , 2016, FPGA.

[43] FengWu-chun,et al. The Green500 List , 2007 .

[44] Erik Brunvand,et al. Mach-RT: a many chip architecture for ray tracing , 2019, High Performance Graphics.

[45] J. Gregory Steffan,et al. Efficient multi-ported memories for FPGAs , 2010, FPGA '10.

[46] Karthikeyan Sankaralingam,et al. MIAOW - An open source RTL implementation of a GPGPU , 2015, 2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII).

[47] Matthew Poremba,et al. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[48] David R. Kaeli,et al. Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[49] Peter Bøgh Andersen. Elastic Systems , 2001, INTERACT.