论文信息 - Multimodal Deep Features Fusion for Video Memorability Prediction

Multimodal Deep Features Fusion for Video Memorability Prediction

This paper describes a multimodal feature fusion approach for predicting the short and long term video memorability where the goal to design a system that automatically predicts scores reflecting the probability of a video being remembered. The approach performs early fusion of text, image, and video features. Text features are extracted using a Convolutional Neural Network (CNN), an FBResNet152 pre-trained on ImageNet is used to extract image features and video features are extracted using 3DResNet152 pre-trained on Kinetics 400. We use Fisher Vectors to obtain a single vector associated with each video that overcomes the need for using a non-fixed global vector representation for handling temporal information. The fusion approach demonstrates good predictive performance and regression superiority in terms of correlation over standard features.

Roberto Leyva | Faiyaz Doctor | Alba G. Seco de Herrera | Sohail Sahab

[1] Qin Jin,et al. RUC at MediaEval 2018: Visual and Textual Features Exploration for Predicting Media Memorability , 2018, MediaEval.

[2] Jurandy Almeida,et al. GIBIS at MediaEval 2018: Predicting Media Memorability Task , 2018, MediaEval.

[3] Savita Bhat,et al. Multimodal Approach to Predicting Media Memorability , 2018, MediaEval.

[4] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5] Chang-Tsun Li,et al. Compact and Low-Complexity Binary Feature Descriptor and Fisher Vectors for Video Analytics , 2019, IEEE Transactions on Image Processing.

[6] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[7] Sabine Süsstrunk,et al. Measuring colorfulness in natural images , 2003, IS&T/SPIE Electronic Imaging.

[8] Zhonglei Gu,et al. Learning Memorability Preserving Subspace for Predicting Media Memorability , 2018, MediaEval.

[9] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Abhinav Gupta,et al. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[13] Jurandy Almeida,et al. Comparison of video sequences with histograms of motion patterns , 2011, 2011 18th IEEE International Conference on Image Processing.

[14] Sumit Shekhar,et al. Show and Recall @ MediaEval 2018 ViMemNet: Predicting Video Memorability , 2018, MediaEval.

[15] Minh-Triet Tran,et al. Predicting Media Memorability Using Deep Features and Recurrent Network , 2018, MediaEval.

[16] Noel E. O'Connor,et al. Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Rohit Gupta,et al. Linear Models for Video Memorability Prediction Using Visual and Semantic Features , 2018, MediaEval.

[18] Tanaya Guha,et al. A multimodal mixture-of-experts model for dynamic emotion prediction in movies , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Catherine Havasi,et al. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[20] Mats Sjöberg,et al. The Predicting Media Memorability Task at MediaEval 2019 , 2019, MediaEval.

[21] Antonio Torralba,et al. Understanding and Predicting Image Memorability at a Large Scale , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22] Xu Zhang,et al. Video Memorability Prediction with Recurrent Neural Networks and Video Titles at the 2018 MediaEval Predicting Media Memorability Task , 2018, MediaEval.

[23] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24] Luc Van Gool,et al. The Interestingness of Images , 2013, 2013 IEEE International Conference on Computer Vision.

[25] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .