Spatiotemporal Knowledge Distillation for Efficient Estimation of Aerial Video Saliency

The performance of video saliency estimation techniques has achieved significant advances along with the rapid development of Convolutional Neural Networks (CNNs). However, devices like cameras and drones may have limited computational capability and storage space so that the direct deployment of complex deep saliency models becomes infeasible. To address this problem, this paper proposes a dynamic saliency estimation approach for aerial videos via spatiotemporal knowledge distillation. In this approach, five components are involved, including two teachers, two students and the desired spatiotemporal model. The knowledge of spatial and temporal saliency is first separately transferred from the two complex and redundant teachers to their simple and compact students, while the input scenes are also degraded from high-resolution to low-resolution to remove the probable data redundancy so as to greatly speed up the feature extraction process. After that, the desired spatiotemporal model is further trained by distilling and encoding the spatial and temporal saliency knowledge of two students into a unified network. In this manner, the inter-model redundancy can be removed for the effective estimation of dynamic saliency on aerial videos. Experimental results show that the proposed approach is comparable to 11 state-of-the-art models in estimating visual saliency on aerial videos, while its speed reaches up to 28,738 FPS and 1,490.5 FPS on the GPU and CPU platforms, respectively.

[1]  Gang Wang,et al.  Recurrent Attentional Networks for Saliency Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ruigang Yang,et al.  Saliency-Aware Video Object Segmentation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Lihi Zelnik-Manor,et al.  Learning Video Saliency from Human Gaze Using Candidate Selection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Aykut Erdem,et al.  Two-Stream Convolutional Networks for Dynamic Saliency Prediction , 2016, ArXiv.

[5]  Peng Jiang,et al.  Salient Region Detection by UFO: Uniqueness, Focusness and Objectness , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Y Wu,et al.  Novel Approach to Position and Orientation Estimation in Vision-Based UAV Navigation , 2010, IEEE Transactions on Aerospace and Electronic Systems.

[7]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Haibin Ling,et al.  Saliency Detection on Light Field , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jian Sun,et al.  Accelerating Very Deep Convolutional Networks for Classification and Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Liqing Zhang,et al.  Dynamic visual attention: searching for coding length increments , 2008, NIPS.

[11]  Matthias Bethge,et al.  DeepGaze II: Predicting fixations from deep features over time and tasks , 2017 .

[12]  Jorma Laaksonen,et al.  Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features , 2016, Neurocomputing.

[13]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Martin D. Levine,et al.  Visual Saliency Based on Scale-Space Analysis in the Frequency Domain , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Sanyuan Zhao,et al.  Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detection , 2018, ECCV.

[16]  Haibin Ling,et al.  Salient Object Detection in the Deep Learning Era: An In-Depth Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Ali Borji,et al.  CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research , 2015, ArXiv.

[18]  Kui Fu,et al.  Model-Guided Multi-Path Knowledge Aggregation for Aerial Saliency Prediction , 2020, IEEE Transactions on Image Processing.

[19]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[20]  Junwei Han,et al.  A Deep Spatial Contextual Long-Term Recurrent Convolutional Network for Saliency Detection , 2016, IEEE Transactions on Image Processing.

[21]  Matthias Bethge,et al.  Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet , 2014, ICLR.

[22]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Shu Fang,et al.  Learning Discriminative Subspaces on Random Contrasts for Image Saliency Analysis , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Liming Zhang,et al.  A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression , 2010, IEEE Transactions on Image Processing.

[25]  Jian Sun,et al.  Efficient and accurate approximations of nonlinear convolutional networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[27]  Homer H. Chen,et al.  Learning-Based Prediction of Visual Attention for Video Signals , 2011, IEEE Transactions on Image Processing.

[28]  Aykut Erdem,et al.  Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction , 2016, IEEE Transactions on Multimedia.

[29]  K. Boudjit,et al.  Detection and implementation autonomous target tracking with a Quadrotor AR.Drone , 2015, 2015 12th International Conference on Informatics in Control, Automation and Robotics (ICINCO).

[30]  Ali Borji,et al.  Probabilistic learning of task-specific visual attention , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Hongdong Li,et al.  Adversarial Spatio-Temporal Learning for Video Deblurring , 2018, IEEE Transactions on Image Processing.

[32]  Tiejun Huang,et al.  Visual Saliency with Statistical Priors , 2013, International Journal of Computer Vision.

[33]  Ling Shao,et al.  Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement , 2015, IEEE Transactions on Image Processing.

[34]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Wen Gao,et al.  Probabilistic Multi-Task Learning for Visual Saliency Estimation in Video , 2010, International Journal of Computer Vision.

[36]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[37]  Hung M. La,et al.  Dynamic Target Tracking and Obstacle Avoidance using a Drone , 2015, ISVC.

[38]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[39]  Cemal Aker,et al.  Using deep networks for drone detection , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[40]  Ke Zhang,et al.  Saliency detection via local structure propagation , 2018, J. Vis. Commun. Image Represent..

[41]  Yong Du,et al.  Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks , 2017, IEEE Transactions on Image Processing.

[42]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[43]  Gang Tao,et al.  Video attention prediction using gaze saliency , 2017, Multimedia Tools and Applications.

[44]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Tie Liu,et al.  DeepVS: A Deep Learning Based Video Saliency Prediction Approach , 2018, ECCV.

[46]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[47]  Matthias Bethge,et al.  DeepGaze II: Reading fixations from deep features trained on object recognition , 2016, ArXiv.

[48]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[49]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[50]  Lihi Zelnik-Manor,et al.  Context-aware saliency detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Michael Dorr,et al.  Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Nicolas Riche,et al.  Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics , 2013, 2013 IEEE International Conference on Computer Vision.

[54]  Nuno Vasconcelos,et al.  How many bits does it take for a stimulus to be salient? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yueting Zhuang,et al.  DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection , 2015, IEEE Transactions on Image Processing.

[56]  Ling-Yu Duan,et al.  Finding the Secret of Image Saliency in the Frequency Domain , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Gang Wang,et al.  Deep Level Sets for Salient Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[59]  Eunhyeok Park,et al.  Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[60]  Xuelong Li,et al.  Visual saliency detection using information divergence , 2013, Pattern Recognit..

[61]  Thomas Martinetz,et al.  Intrinsic Dimensionality Predicts the Saliency of Natural Dynamic Scenes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[63]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[64]  Laurent Itti,et al.  Automatic foveation for video compression using a neurobiological model of visual attention , 2004, IEEE Transactions on Image Processing.

[65]  Victor Leboran,et al.  Dynamic Whitening Saliency , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Weisi Lin,et al.  A Video Saliency Detection Model in Compressed Domain , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[68]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[69]  Ali Borji,et al.  Boosting bottom-up and top-down visual features for saliency estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Jing Li,et al.  Visual Attention Modeling for Stereoscopic Video: A Benchmark and Computational Model , 2017, IEEE Transactions on Image Processing.

[71]  Chun Chen,et al.  Low-level and high-level prior learning for visual saliency estimation , 2014, Inf. Sci..

[72]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[73]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[74]  Wataru Shimoda,et al.  Saliency detection by forward and backward cues in deep-CNN , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[75]  Yafei Song,et al.  A Data-Driven Metric for Comprehensive Evaluation of Saliency Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[76]  Nathalie Guyader,et al.  Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos , 2009, International Journal of Computer Vision.

[77]  Kui Fu,et al.  How Drones Look: Crowdsourced Knowledge Transfer for Aerial Video Saliency Prediction , 2018, ArXiv.

[78]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[79]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[80]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[81]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[82]  Zhenzhong Chen,et al.  Improving Saliency Detection Based on Modeling Photographer's Intention , 2019, IEEE Transactions on Multimedia.