Combining Textual and Visual Modeling for Predicting Media Memorability

This paper describes amultimodal approach proposed by theMeMAD team for the MediaEval 2019 “Predicting Media memorability” task. Our best approach is a weighted average method combining predictions made separately from visual and textual representations of videos. In particular, we augmented the provided textual descriptions with automatically generated deep captions. For long term memorability, we obtained better scores using the short term predictions rather than the long term ones. Our best model achieves Spearman scores of 0.522 and 0.277 respectively for the short and long term predictions tasks.

[1]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Minh-Triet Tran,et al.  Predicting Media Memorability Using Deep Features and Recurrent Network , 2018, MediaEval.

[3]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Benoit Huet,et al.  L-STAP: Learned Spatio-Temporal Adaptive Pooling for Video Captioning , 2019, AI4TV@MM.

[9]  Jorma Laaksonen,et al.  PicSOM Experiments in TRECVID 2018 , 2015, TRECVID.

[10]  Rohit Gupta,et al.  Linear Models for Video Memorability Prediction Using Visual and Semantic Features , 2018, MediaEval.

[11]  Yale Song,et al.  TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Mats Sjöberg,et al.  The Predicting Media Memorability Task at MediaEval 2019 , 2019, MediaEval.

[15]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).