Multimodal Deep Features Fusion for Video Memorability Prediction

This paper describes a multimodal feature fusion approach for predicting the short and long term video memorability where the goal to design a system that automatically predicts scores reflecting the probability of a video being remembered. The approach performs early fusion of text, image, and video features. Text features are extracted using a Convolutional Neural Network (CNN), an FBResNet152 pre-trained on ImageNet is used to extract image features and video features are extracted using 3DResNet152 pre-trained on Kinetics 400. We use Fisher Vectors to obtain a single vector associated with each video that overcomes the need for using a non-fixed global vector representation for handling temporal information. The fusion approach demonstrates good predictive performance and regression superiority in terms of correlation over standard features.

[1]  Qin Jin,et al.  RUC at MediaEval 2018: Visual and Textual Features Exploration for Predicting Media Memorability , 2018, MediaEval.

[2]  Jurandy Almeida,et al.  GIBIS at MediaEval 2018: Predicting Media Memorability Task , 2018, MediaEval.

[3]  Savita Bhat,et al.  Multimodal Approach to Predicting Media Memorability , 2018, MediaEval.

[4]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5]  Chang-Tsun Li,et al.  Compact and Low-Complexity Binary Feature Descriptor and Fisher Vectors for Video Analytics , 2019, IEEE Transactions on Image Processing.

[6]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[7]  Sabine Süsstrunk,et al.  Measuring colorfulness in natural images , 2003, IS&T/SPIE Electronic Imaging.

[8]  Zhonglei Gu,et al.  Learning Memorability Preserving Subspace for Predicting Media Memorability , 2018, MediaEval.

[9]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[13]  Jurandy Almeida,et al.  Comparison of video sequences with histograms of motion patterns , 2011, 2011 18th IEEE International Conference on Image Processing.

[14]  Sumit Shekhar,et al.  Show and Recall @ MediaEval 2018 ViMemNet: Predicting Video Memorability , 2018, MediaEval.

[15]  Minh-Triet Tran,et al.  Predicting Media Memorability Using Deep Features and Recurrent Network , 2018, MediaEval.

[16]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Rohit Gupta,et al.  Linear Models for Video Memorability Prediction Using Visual and Semantic Features , 2018, MediaEval.

[18]  Tanaya Guha,et al.  A multimodal mixture-of-experts model for dynamic emotion prediction in movies , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[20]  Mats Sjöberg,et al.  The Predicting Media Memorability Task at MediaEval 2019 , 2019, MediaEval.

[21]  Antonio Torralba,et al.  Understanding and Predicting Image Memorability at a Large Scale , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Xu Zhang,et al.  Video Memorability Prediction with Recurrent Neural Networks and Video Titles at the 2018 MediaEval Predicting Media Memorability Task , 2018, MediaEval.

[23]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Luc Van Gool,et al.  The Interestingness of Images , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .