Multimodal social media video classification with deep neural networks

Classifying videos according to their content is a common task across various contexts, as it allows effective content tagging, indexing and searching. In this work, we propose a general framework for video classification that is built on top of several neural network architectures. Since we rely on a multimodal approach, we extract both visual and textual features from videos and combine them in a final classification algorithm. When trained on a dataset of 30 000 social media videos and evaluated on 6 000 videos, our multimodal deep learning algorithm outperforms shallow single-modality classification methods by a large margin of up to 95%, achieving overall accuracy of 88%.

[1]  Noel E. O'Connor,et al.  Bags of Local Convolutional Features for Scalable Instance Search , 2016, ICMR.

[2]  Yiannis Andreopoulos,et al.  Video Classification With CNNs: Using the Codec as a Spatio-Temporal Activity Sensor , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Cordelia Schmid,et al.  Evaluation of GIST descriptors for web-scale image search , 2009, CIVR '09.

[4]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[6]  Pavlo Molchanov,et al.  Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification , 2016, ACM Multimedia.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ngai-Man Cheung,et al.  Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text , 2017, ArXiv.

[10]  Karl Aberer,et al.  Multimodal Classification for Analysing Social Media , 2017, ArXiv.

[11]  Jun Wang,et al.  Fusing Multi-Stream Deep Networks for Video Classification , 2015, ArXiv.

[12]  Mustafa Sert,et al.  Multimodal video concept classification based on convolutional neural network and audio feature combination , 2017, 2017 25th Signal Processing and Communications Applications Conference (SIU).

[13]  Xiaoqing Feng,et al.  Multimodal video classification with stacked contractive autoencoders , 2016, Signal Process..