3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low Bitwidth Quantization, and Ultra-Low Latency Acceleration

The deep neural network (DNN) based AI applications on the edge require both low-cost computing platforms and high-quality services. However, the limited memory, computing resources, and power budget of the edge devices constrain the effectiveness of the DNN algorithms. Developing edge-oriented AI algorithms and implementations (e.g., accelerators) is challenging. In this paper, we summarize our recent efforts for efficient on-device AI development from three aspects, including both training and inference. First, we present on-device training with ultra-low memory usage. We propose a novel rank-adaptive tensor-based tensorized neural network model, which offers orders-of-magnitude memory reduction during training. Second, we introduce an ultra-low bitwidth quantization method for DNN model compression, achieving the state-of-the-art accuracy under the same compression ratio. Third, we introduce an ultra-low latency DNN accelerator design, practicing the software/hardware co-design methodology. This paper emphasizes the importance and efficacy of training, quantization and accelerator design, and calls for more research breakthroughs in the area for AI on the edge.

[1]  Jinjun Xiong,et al.  FPGA/DNN Co-Design: An Efficient Design Methodology for 1oT Intelligence on the Edge , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[2]  Hiroki Nakahara,et al.  A fully connected layer elimination for a binarizec convolutional neural network on an FPGA , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[3]  Wonyong Sung,et al.  FPGA based implementation of deep neural networks using on-chip memory only , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[5]  Shenghuo Zhu,et al.  Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM , 2017, AAAI.

[6]  Hongjun Wang,et al.  Real-Time Object Tracking System on FPGAs , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[7]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[8]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[9]  Zheng Zhang,et al.  Bayesian Tensorized Neural Networks with Automatic Rank Selection , 2019, Neurocomputing.

[10]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Nicholas Caldwell,et al.  Scalable high-performance architecture for convolutional ternary neural networks on FPGA , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[12]  Satoshi Nakamura,et al.  Compressing recurrent neural network with tensor train , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[13]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[14]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[15]  Luca Benini,et al.  Enabling Design Methodologies and Future Trends for Edge AI: Specialization and Codesign , 2021, IEEE Design & Test.

[16]  Kai Zhang,et al.  T-DLA: An Open-source Deep Learning Accelerator for Ternarized DNN Models on Embedded FPGA , 2019, 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[17]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[18]  Danilo P. Mandic,et al.  Tucker Tensor Layer in Fully Connected Neural Networks , 2019, ArXiv.

[19]  H. T. Kung,et al.  Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[20]  Soheil Ghiasi,et al.  Hardware-oriented Approximation of Convolutional Neural Networks , 2016, ArXiv.

[21]  Valentin Khrulkov,et al.  Tensorized Embedding Layers for Efficient Model Compression , 2019, ArXiv.

[22]  Christopher J. Hillar,et al.  Most Tensor Problems Are NP-Hard , 2009, JACM.

[23]  Yang Liu,et al.  Two-Step Quantization for Low-bit Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Jinjun Xiong,et al.  DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[25]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[26]  Bo Yuan,et al.  Compressing Recurrent Neural Networks Using Hierarchical Tucker Tensor Decomposition , 2020, ArXiv.

[27]  Deming Chen,et al.  µL2Q: An Ultra-Low Loss Quantization Method for DNN Compression , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[28]  Cole Hawkins,et al.  On-FPGA Training with Ultra Memory Reduction: A Low-Precision Tensor Method , 2021, ArXiv.

[29]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[30]  Yao Chen,et al.  Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs , 2019, FPGA.

[31]  Yue Wang,et al.  E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings , 2019, NeurIPS.

[32]  Alexander Novikov,et al.  Ultimate tensorization: compressing convolutional and FC layers alike , 2016, ArXiv.

[33]  Tao Li,et al.  VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization , 2020, IEEE Transactions on Computers.

[34]  Ivan V. Oseledets,et al.  Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition , 2014, ICLR.

[35]  Di He,et al.  Machine learning on FPGAs to face the IoT revolution , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[36]  Ce Zhu,et al.  Tensor rank learning in CP decomposition via convolutional neural network , 2019, Signal Process. Image Commun..

[37]  Frédéric Pétrot,et al.  Ternary neural networks for resource-efficient AI applications , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[38]  Rajesh Gupta,et al.  Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.

[39]  Yoshua Bengio,et al.  BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , 2016, ArXiv.

[40]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[41]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[42]  Zheng Zhang,et al.  Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination , 2020, SIAM J. Math. Data Sci..

[43]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.