论文信息 - 3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low Bitwidth Quantization, and Ultra-Low Latency Acceleration

3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low Bitwidth Quantization, and Ultra-Low Latency Acceleration

The deep neural network (DNN) based AI applications on the edge require both low-cost computing platforms and high-quality services. However, the limited memory, computing resources, and power budget of the edge devices constrain the effectiveness of the DNN algorithms. Developing edge-oriented AI algorithms and implementations (e.g., accelerators) is challenging. In this paper, we summarize our recent efforts for efficient on-device AI development from three aspects, including both training and inference. First, we present on-device training with ultra-low memory usage. We propose a novel rank-adaptive tensor-based tensorized neural network model, which offers orders-of-magnitude memory reduction during training. Second, we introduce an ultra-low bitwidth quantization method for DNN model compression, achieving the state-of-the-art accuracy under the same compression ratio. Third, we introduce an ultra-low latency DNN accelerator design, practicing the software/hardware co-design methodology. This paper emphasizes the importance and efficacy of training, quantization and accelerator design, and calls for more research breakthroughs in the area for AI on the edge.

[1] Jinjun Xiong,et al. FPGA/DNN Co-Design: An Efficient Design Methodology for 1oT Intelligence on the Edge , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[2] Hiroki Nakahara,et al. A fully connected layer elimination for a binarizec convolutional neural network on an FPGA , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[3] Wonyong Sung,et al. FPGA based implementation of deep neural networks using on-chip memory only , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Dilin Wang,et al. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[5] Shenghuo Zhu,et al. Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM , 2017, AAAI.

[6] Hongjun Wang,et al. Real-Time Object Tracking System on FPGAs , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[7] Alexander Novikov,et al. Tensorizing Neural Networks , 2015, NIPS.

[8] Ivan Oseledets,et al. Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[9] Zheng Zhang,et al. Bayesian Tensorized Neural Networks with Automatic Rank Selection , 2019, Neurocomputing.

[10] Bin Liu,et al. Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Nicholas Caldwell,et al. Scalable high-performance architecture for convolutional ternary neural networks on FPGA , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[12] Satoshi Nakamura,et al. Compressing recurrent neural network with tensor train , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[13] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[14] Philip Heng Wai Leong,et al. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[15] Luca Benini,et al. Enabling Design Methodologies and Future Trends for Edge AI: Specialization and Codesign , 2021, IEEE Design & Test.

[16] Kai Zhang,et al. T-DLA: An Open-source Deep Learning Accelerator for Ternarized DNN Models on Embedded FPGA , 2019, 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[17] Tamara G. Kolda,et al. Tensor Decompositions and Applications , 2009, SIAM Rev..

[18] Danilo P. Mandic,et al. Tucker Tensor Layer in Fully Connected Neural Networks , 2019, ArXiv.

[19] H. T. Kung,et al. Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[20] Soheil Ghiasi,et al. Hardware-oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.

[21] Valentin Khrulkov,et al. Tensorized Embedding Layers for Efficient Model Compression , 2019, ArXiv.

[22] Christopher J. Hillar,et al. Most Tensor Problems Are NP-Hard , 2009, JACM.

[23] Yang Liu,et al. Two-Step Quantization for Low-bit Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24] Jinjun Xiong,et al. DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[25] Yoshua Bengio,et al. BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[26] Bo Yuan,et al. Compressing Recurrent Neural Networks Using Hierarchical Tucker Tensor Decomposition , 2020, ArXiv.

[27] Deming Chen,et al. µL2Q: An Ultra-Low Loss Quantization Method for DNN Compression , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[28] Cole Hawkins,et al. On-FPGA Training with Ultra Memory Reduction: A Low-Precision Tensor Method , 2021, ArXiv.

[29] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[30] Yao Chen,et al. Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs , 2019, FPGA.

[31] Yue Wang,et al. E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings , 2019, NeurIPS.

[32] Alexander Novikov,et al. Ultimate tensorization: compressing convolutional and FC layers alike , 2016, ArXiv.

[33] Tao Li,et al. VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization , 2020, IEEE Transactions on Computers.

[34] Ivan V. Oseledets,et al. Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[35] Di He,et al. Machine learning on FPGAs to face the IoT revolution , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[36] Ce Zhu,et al. Tensor rank learning in CP decomposition via convolutional neural network , 2019, Signal Process. Image Commun..

[37] Frédéric Pétrot,et al. Ternary neural networks for resource-efficient AI applications , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[38] Rajesh Gupta,et al. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.

[39] Yoshua Bengio,et al. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , 2016, ArXiv.

[40] Yinghai Lu,et al. Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[41] Chong Wang,et al. Stochastic variational inference , 2012, J. Mach. Learn. Res..

[42] Zheng Zhang,et al. Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination , 2020, SIAM J. Math. Data Sci..

[43] Yu Wang,et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.