Knowledge Distillation with Attention for Deep Transfer Learning of Convolutional Networks

Transfer learning through fine-tuning a pre-trained neural network with an extremely large dataset, such as ImageNet, can significantly improve and accelerate training while the accuracy is frequently bottlenecked by the limited dataset size of the new target task. To solve the problem, some regularization methods, constraining the outer layer weights of the target network using the starting point as references (SPAR), have been studied. In this article, we propose a novel regularized transfer learning framework \operatorname{DELTA} , namely DE ep L earning T ransfer using Feature Map with A ttention . Instead of constraining the weights of neural network, \operatorname{DELTA} aims at preserving the outer layer outputs of the source network. Specifically, in addition to minimizing the empirical loss, \operatorname{DELTA} aligns the outer layer outputs of two networks, through constraining a subset of feature maps that are precisely selected by attention that has been learned in a supervised learning manner. We evaluate \operatorname{DELTA} with the state-of-the-art algorithms, including L^2 and \emph {L}^2\text{-}SP . The experiment results show that our method outperforms these baselines with higher accuracy for new tasks. Code has been made publicly available. 1

[1]  Rongrong Ji,et al.  Holistic CNN Compression via Low-Rank Decomposition with Knowledge Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Bo Zhao,et al.  Diversified Visual Attention Networks for Fine-Grained Object Classification , 2016, IEEE Transactions on Multimedia.

[3]  Yongdong Zhang,et al.  STAT: Spatial-Temporal Attention Mechanism for Video Captioning , 2020, IEEE Transactions on Multimedia.

[4]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[5]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6]  Xiangui Kang,et al.  Audio Recapture Detection With Convolutional Neural Networks , 2016, IEEE Transactions on Multimedia.

[7]  Yu Zhang,et al.  Parameter Transfer Unit for Deep Neural Networks , 2018, PAKDD.

[8]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .

[9]  Zhanxing Zhu,et al.  Towards Making Deep Transfer Learning Never Hurt , 2019, 2019 IEEE International Conference on Data Mining (ICDM).