Multimodal Video Saliency Analysis With User-Biased Information

Video saliency is widely used in various video understanding and processing related applications. Despite the fact that studies have indicated the influence of user preferences on visual attention when watching videos, current researches on saliency are based on visual contents and have not taken viewer-related information into account. In this paper, we propose a learning-based multimodal framework to predict video saliency aided by social data analysis. We introduce a popularity assisted attention mechanism into a content-specific neural network to extract spatio-motion features, and utilize a convolutional long short-term memory (ConvLSTM) network to discover temporal characteristics. Experiments demonstrate that our approach outperforms the state-of-the-art video saliency analysis methods, which validates the effectiveness of incorporating external user-biased information into saliency prediction.

[1]  Tie Liu,et al.  DeepVS: A Deep Learning Based Video Saliency Prediction Approach , 2018, ECCV.

[2]  A. Kingstone,et al.  Saliency does not account for fixations to eyes within social scenes , 2009, Vision Research.

[3]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Aykut Erdem,et al.  Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction , 2016, IEEE Transactions on Multimedia.

[6]  Feng He,et al.  Find Who to Look at: Turning From Action to Saliency , 2018, IEEE Transactions on Image Processing.

[7]  Li Fei-Fei,et al.  Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Noel E. O'Connor,et al.  SalGAN: Visual Saliency Prediction with Generative Adversarial Networks , 2017, ArXiv.

[11]  Xuming He,et al.  Predicting Salient Face in Multiple-Face Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Christof Koch,et al.  Predicting human gaze using low-level saliency combined with face detection , 2007, NIPS.

[13]  Wei-Shi Zheng,et al.  PersonRank: Detecting Important People in Images , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[14]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[15]  Andrew C. Gallagher,et al.  VIP: Finding important people in images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Haibin Ling,et al.  Revisiting Video Saliency Prediction in the Deep Learning Era , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.