MEG: A RISCV-based System Emulation Infrastructure for Near-data Processing Using FPGAs and High-bandwidth Memory

Emerging three-dimensional (3D) memory technologies, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), provide high-bandwidth and massive memory-level parallelism. With the growing heterogeneity and complexity of computer systems (CPU cores and accelerators, etc.), efficiently integrating emerging memories into existing systems poses new challenges and requires detailed evaluation in a realistic computing environment. In this article, we propose MEG, an open source, configurable, cycle-exact, and RISC-V-based full-system emulation infrastructure using FPGA and HBM. MEG provides a highly modular hardware design and includes a bootable Linux image for a realistic software flow, so that users can perform cross-layer software-hardware co-optimization in a full-system environment. To improve the observability and debuggability of the system, MEG also provides a flexible performance monitoring scheme to guide the performance optimization. The proposed MEG infrastructure can potentially benefit broad communities across computer architecture, system software, and application software. Leveraging MEG, we present two cross-layer system optimizations as illustrative cases to demonstrate the usability of MEG. In the first case study, we present a reconfigurable memory controller to improve the address mapping of standard memory controller. This reconfigurable memory controller along with its OS support allows us to optimize the address mapping scheme to fully exploit the massive parallelism provided by the emerging three-dimensional (3D) memories. In the second case study, we present a lightweight IOMMU design to tackle the unique challenges brought by 3D memory in providing virtual memory support for near-memory accelerators. We provide a prototype implementation of MEG on a Xilinx VU37P FPGA and demonstrate its capability, fidelity, and flexibility on real-world benchmark applications. We hope MEG fills a gap in the space of publicly available FPGA-based full-system emulation infrastructures, specifically targeting memory systems, and inspires further collaborative software/hardware innovations.

[1]  Makoto Motoyoshi,et al.  Through-Silicon Via (TSV) , 2009, Proceedings of the IEEE.

[2]  Lieven Eeckhout,et al.  Fast, Accurate, and Validated Full-System Software Simulation of x86 Hardware , 2010, IEEE Micro.

[3]  Inderjeet Mani,et al.  Multi-Document Summarization by Graph Search and Matching , 1997, AAAI/IAAI.

[4]  Maya Gokhale,et al.  Near memory key/value lookup acceleration , 2017, MEMSYS.

[5]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[6]  Lesley Shannon,et al.  Evaluating the Performance Efficiency of a Soft-Processor, Variable-Length, Parallel-Execution-Unit Architecture for FPGAs Using the RISC-V ISA , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[7]  Maya Gokhale,et al.  Microscope on Memory: MPSoC-Enabled Computer Memory System Assessments , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[8]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[9]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[10]  Luca Benini,et al.  Design and Evaluation of a Processing-in-Memory Architecture for the Smart Memory Cube , 2016, ARCS.

[11]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[12]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[13]  Babak Falsafi,et al.  ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs , 2009, TRETS.

[14]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[15]  Luca Benini,et al.  Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU , 2019, IEEE Transactions on Computers.

[16]  Gustavo Alonso,et al.  Runtime Parameterizable Regular Expression Operators for Databases , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[17]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[18]  M SwiftMichael,et al.  Efficient virtual memory for big memory servers , 2013 .

[19]  Krste Asanovic,et al.  The RISC-V Instruction Set Manual Volume 2: Privileged Architecture Version 1.7 , 2015 .

[20]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[21]  Jason Cong,et al.  Supporting Address Translation for Accelerator-Centric Architectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22]  Gustavo Alonso,et al.  doppioDB: A hardware accelerated database , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[23]  Yoonho Park,et al.  Data access optimization in a processing-in-memory system , 2015, Conf. Computing Frontiers.

[24]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[25]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Guy Lemieux,et al.  Embedded supercomputing in FPGAs with the VectorBlox MXP Matrix Processor , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[27]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[28]  MutluOnur,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015 .

[29]  B. Nikolić,et al.  BOOM v 2 an open-source out-of-order RISC-V core , 2017 .

[30]  Piotr Luszczek,et al.  Design and Implementation of the HPC Challenge Benchmark Suite , 2011 .

[31]  Aditya Chopra,et al.  FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[32]  Yuan Zhou,et al.  Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs , 2018, FPGA.

[33]  Ki-Seok Chung,et al.  CasHMC: A Cycle-Accurate Simulator for Hybrid Memory Cube , 2017, IEEE Computer Architecture Letters.

[34]  A. Azzouz 2011 , 2020, City.

[35]  Michael M. Swift,et al.  Devirtualizing Memory in Heterogeneous Systems , 2018, ASPLOS.

[36]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[37]  Manos Athanassoulis,et al.  Beyond the Wall: Near-Data Processing for Databases , 2015, DaMoN.

[38]  Andrew V. Goldberg,et al.  A new approach to the maximum flow problem , 1986, STOC '86.

[39]  Krste Asanovic,et al.  Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs , 2020, IEEE Micro.

[40]  Yong Chen,et al.  HMC-Sim-2.0: A Simulation Platform for Exploring Custom Memory Cube Operations , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[41]  Sung Min Kim,et al.  A stacked memory device on logic 3D technology for ultra-high-density data storage , 2011, Nanotechnology.

[42]  Onur Mutlu,et al.  SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[43]  Yunsup Lee,et al.  The RISC-V Instruction Set Manual , 2014 .

[44]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[45]  Jing Li,et al.  Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search , 2017, FPGA.

[46]  James C. Hoe,et al.  FPGA-Accelerated Simulation of Computer Systems , 2014, FPGA-Accelerated Simulation of Computer Systems.