FERMAT: FPGA-Accelerated Heterogeneous Computing Platform Near NVMe Storage

This paper proposes FERMAT, a versatile FPGA-accelerated near-storage computing platform that aims at significantly reducing data latency and energy consumption for data-intensive applications running on a heterogeneous computing system. Two key ideas are contributing to FERMAT’s success. Firstly, FERMAT, through creating direct and parallel I/O channels between a processor and NVMe (Non-Volatile Memory Express) storage with reconfigurable digital fabric as well as bypassing all OS software stack, can significantly reduce unnecessary data movements in order to deliver low-latency and high- bandwidth I/O. Secondly, FERMAT, through "pre-computing" a large amount of data near NVMe storage with FPGA-based computing engines, can effectively shift part of computing in a target application to the data source. To further facilitate the deployment of FERMAT, we provide general system-level support and an effective abstraction to the near-storage computing such that FERMAT can be used on any platform equipped with an NVMe storage and achieve overall higher performance.To fully validate this proposed approach, in hardware, we have designed and implemented an open-source FPGA-based self-managed NVMe controller that 1) directly connects an FPGA-based accelerator with the storage while bypassing all software stacks, and 2) transforms an FPGA device into an in-line computing engine, where multiple user-programmable streaming accelerators concurrently process file streams, that greatly improves data-intensive applications’ performance. In software, 3) we designed a dedicated software stack equipping FPGA-accelerated storage with a user-space filesystem supporting all the common file operations on modern Linux systems and a flexible and thread-safe compute engine programming interface to ease user control on the compute engine’s functionality. We measured the performance of FERMAT against the baseline with five benchmarks from three categories: security, graph query, and graph analysis. FERMAT demonstrated significant speedups ranging from 1.8x to 782.5x in processing throughput.