Memory Access Optimization for On-Chip Transfer Learning

Training of Deep Neural Network (DNN) at the edge faces the challenge of high energy consumption due to the requirements of a large number of memory accesses for gradient calculations. Therefore, it is necessary to minimize data fetches to perform training of a DNN model on the edge. In this paper, a novel technique has been proposed to reduce the memory access for the training of fully connected layers in transfer learning. By analyzing the memory access patterns in the backpropagation phase in fully connected layers, the memory access can be optimized. We introduce a new method to update the weights by introducing the delta term for every node of output and fully connected layer. Delta term aims to reduce memory access for the parameters which are required to access repeatedly during the training process of fully connected layers. The proposed technique shows 0.13x-13.93x energy savings for the training of fully connected layers for famous DNN architectures on multiple processor architectures. The proposed technique can be used to perform transfer learning on-chip to reduce energy consumption as well as memory access.

[1]  Massoud Pedram,et al.  NullaNet: Training Deep Neural Networks for Reduced-Memory-Access Inference , 2018, ArXiv.

[2]  Hoi-Jun Yoo,et al.  A Power-Efficient CNN Accelerator With Similar Feature Skipping for Face Recognition in Mobile Devices , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[3]  Yanzhi Wang,et al.  A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers , 2018, ECCV.

[4]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[6]  Patrick Hansen,et al.  FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning , 2019, ArXiv.

[7]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Jinjun Xiong,et al.  ELFISH: Resource-Aware Federated Learning on Heterogeneous Edge Devices , 2019, ArXiv.

[9]  Alessandro Aimar,et al.  NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[11]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[12]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[15]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[17]  Kin K. Leung,et al.  When Edge Meets Learning: Adaptive Control for Resource-Constrained Distributed Machine Learning , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[18]  Pangfeng Liu,et al.  Data Pinning and Back Propagation Memory Optimization for Deep Learning on GPU , 2018, 2018 Sixth International Symposium on Computing and Networking (CANDAR).

[19]  Hoi-Jun Yoo,et al.  A Low-Power Deep Neural Network Online Learning Processor for Real-Time Object Tracking Application , 2019, IEEE Transactions on Circuits and Systems I: Regular Papers.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Eric P. Xing,et al.  GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[22]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[23]  Fabrizio Lombardi,et al.  High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization , 2021, IEEE Transactions on Circuits and Systems I: Regular Papers.

[24]  Yu Cao,et al.  Automatic Compilation of Diverse CNNs Onto High-Performance FPGA Accelerators , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[27]  Sebastian Caldas,et al.  Expanding the Reach of Federated Learning by Reducing Client Resource Requirements , 2018, ArXiv.

[28]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[29]  Hiram Ponce,et al.  Deep Learning for Multimodal Fall Detection , 2019, 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC).

[30]  Yung-Hsiang Lu,et al.  Cloud Computing for Mobile Users: Can Offloading Computation Save Energy? , 2010, Computer.

[31]  Xian Zhou,et al.  A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability , 2020, ISCAS.

[32]  Li-Jia Li,et al.  Multi-view Face Detection Using Deep Convolutional Neural Networks , 2015, ICMR.

[33]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Stephan Günnemann,et al.  Introduction to Tensor Decompositions and their Applications in Machine Learning , 2017, ArXiv.

[36]  Xiaowei Li,et al.  FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[37]  Hyuk-Jae Lee,et al.  A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[38]  George K. Thiruvathukal,et al.  Low-Power Computer Vision: Status, Challenges, and Opportunities , 2019, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[39]  Sek Chai,et al.  Bit Efficient Quantization for Deep Neural Networks , 2019, ArXiv.

[40]  Vivienne Sze,et al.  Designing Energy-Efficient Convolutional Neural Networks Using Energy-Aware Pruning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Qiaosha Zou,et al.  A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65-nm CMOS , 2020, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.