论文信息 - DPUV3INT8: A Compiler View to programmable FPGA Inference Engines

DPUV3INT8: A Compiler View to programmable FPGA Inference Engines

We have a FPGA design, we make it fast, efficient, and tested for a few important examples. Now we must infer a general solution to deploy in the data center. Here, we describe the FPGA DPUV3INT8 design and our compiler effort. The hand-tuned SW-HW solution for Resnet50 v1 has (close to) 2 times better images per second (throughput) than our best FPGA implementation; the compiler generalizes the hand written techniques achieving about 1.5 times better performance for the same example, the compiler generalizes the optimizations to a model zoo of networks, and it achieves 80+% HW efficiency.

[1] Matteo Frigo,et al. Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[2] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3] Paolo D'Alberto,et al. Multiple-Campaign Ad-Targeting Deployment: Parallel Response Modeling, Calibration and Scoring Without Personal User Information , 2015 .

[4] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .

[5] João M. P. Cardoso,et al. Optimizing OpenCL Code for Performance on FPGA: k-Means Case Study With Integer Data Sets , 2020, IEEE Access.

[6] Cedric Nugteren,et al. CLBlast: A Tuned OpenCL BLAS Library , 2017, IWOCL.

[7] She Muses. Spiral , 2021, Encyclopedic Dictionary of Archaeology.

[8] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9] Ramachandra Achar,et al. A Comparative Study of MAGMA and cuBLAS Libraries for GPU based Vector Fitting , 2020, 2020 IEEE 11th Latin American Symposium on Circuits & Systems (LASCAS).

[10] P. D'Alberto,et al. xDNN: Inference for Deep Convolutional Neural Networks , 2022, ACM Trans. Reconfigurable Technol. Syst..

[11] Thomas Fahringer,et al. SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing , 2020, Euro-Par.

[12] Milind Girkar,et al. On the exploitation of loop-level parallelism in embedded applications , 2009, TECS.

[13] Nazeeruddin Mohammad,et al. A review of CUDA optimization techniques and tools for structured grid computing , 2019, Computing.