DPUV3INT8: A Compiler View to programmable FPGA Inference Engines

We have a FPGA design, we make it fast, efficient, and tested for a few important examples. Now we must infer a general solution to deploy in the data center. Here, we describe the FPGA DPUV3INT8 design and our compiler effort. The hand-tuned SW-HW solution for Resnet50 v1 has (close to) 2 times better images per second (throughput) than our best FPGA implementation; the compiler generalizes the hand written techniques achieving about 1.5 times better performance for the same example, the compiler generalizes the optimizations to a model zoo of networks, and it achieves 80+% HW efficiency.

[1]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[2]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  Paolo D'Alberto,et al.  Multiple-Campaign Ad-Targeting Deployment: Parallel Response Modeling, Calibration and Scoring Without Personal User Information , 2015 .

[4]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .

[5]  João M. P. Cardoso,et al.  Optimizing OpenCL Code for Performance on FPGA: k-Means Case Study With Integer Data Sets , 2020, IEEE Access.

[6]  Cedric Nugteren,et al.  CLBlast: A Tuned OpenCL BLAS Library , 2017, IWOCL.

[7]  She Muses Spiral , 2021, Encyclopedic Dictionary of Archaeology.

[8]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Ramachandra Achar,et al.  A Comparative Study of MAGMA and cuBLAS Libraries for GPU based Vector Fitting , 2020, 2020 IEEE 11th Latin American Symposium on Circuits & Systems (LASCAS).

[10]  P. D'Alberto,et al.  xDNN: Inference for Deep Convolutional Neural Networks , 2022, ACM Trans. Reconfigurable Technol. Syst..

[11]  Thomas Fahringer,et al.  SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing , 2020, Euro-Par.

[12]  Milind Girkar,et al.  On the exploitation of loop-level parallelism in embedded applications , 2009, TECS.

[13]  Nazeeruddin Mohammad,et al.  A review of CUDA optimization techniques and tools for structured grid computing , 2019, Computing.