A new multimodal deep-learning model to video scene segmentation

The recent development of deep learning techniques, like convolutional networks, shed a new light over the video (story) scene segmentation problem, bringing the potential to outperform state-of-the-art non-deep learning multimodal approaches. However, one important aspect of the multimodality still needs investigation in the context of deep learning: the multimodal fusion. Often, features are directly fed to a network, which may be an inadequate approach to perform the underlying multimodal fusion. This paper presents an evaluation of early and late approaches to deep learning multimodal fusion. In addition, it proposes a new deep learning model to perform video scene segmentation, based on convolutional network feature extraction capabilities and a recurrent neural network architecture. The results show the early versus late fusion discussion is reopened regarding deep learning. Moreover, the results prove the proposed model is competitive against state-of-the-art techniques when evaluated over a public documentary video dataset obtaining up to 64 of average FCO, while also maintaining a lower computational cost when compared with a related convolutional approach.

[1]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yiannis Kompatsiaris,et al.  Temporal Video Segmentation to Scenes Using High-Level Audiovisual Features , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[7]  Marcel Worring,et al.  Systematic evaluation of logical story unit segmentation , 2002, IEEE Trans. Multim..

[8]  Alan Hanjalic,et al.  Automated high-level movie segmentation for advanced video-retrieval systems , 1999, IEEE Trans. Circuits Syst. Video Technol..

[9]  Nicu Sebe,et al.  Personalization in multimedia retrieval: A survey , 2010, Multimedia Tools and Applications.

[10]  Rita Cucchiara,et al.  Measuring Scene Detection Performance , 2015, IbPRIA.

[11]  Rudinei Goularte,et al.  An Evaluation of Readily Usable Automatic Video Shot Segmentation Techniques , 2016, WebMedia.

[12]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[13]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Nikolas P. Galatsanos,et al.  Scene Detection in Videos Using Shot Clustering and Sequence Alignment , 2009, IEEE Transactions on Multimedia.

[16]  Alfred Kobsa,et al.  Personalized Digital Television: Targeting Programs to Individual Viewers (Human-Computer Interaction Series, 6) , 2004 .

[17]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[18]  Maozhu Jin,et al.  Study on a New Video Scene Segmentation Algorithm , 2015 .

[19]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[20]  Rita Cucchiara,et al.  Recognizing and Presenting the Storytelling Video Structure With Deep Multimodal Networks , 2016, IEEE Transactions on Multimedia.

[21]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[22]  John Zimmerman,et al.  Media Augmentation and Personalization Through Multimedia Processing and Information Extraction , 2004, Personalized Digital Television.

[23]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24]  Rudinei Goularte,et al.  Shot-HR: a video shot representation method based on visual features , 2015, SAC.

[25]  Thomas Wiatowski,et al.  A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction , 2015, IEEE Transactions on Information Theory.

[26]  A. Gupta,et al.  Applications of MFCC and Vector Quantization in speaker recognition , 2013, 2013 International Conference on Intelligent Systems and Signal Processing (ISSP).

[27]  Gertjan J. Burghouts,et al.  Performance evaluation of local colour invariants , 2009, Comput. Vis. Image Underst..

[28]  Rita Cucchiara,et al.  A Deep Siamese Network for Scene Detection in Broadcast Videos , 2015, ACM Multimedia.

[29]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[30]  Argyris Kalogeratos,et al.  Movie segmentation into scenes and chapters using locally weighted bag of visual words , 2009, CIVR '09.

[31]  László Böszörményi,et al.  State-of-the-art and future challenges in video scene detection: a survey , 2013, Multimedia Systems.

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Boon-Lock Yeo,et al.  Video query: Research directions , 1998, IBM J. Res. Dev..

[35]  Wei Liu,et al.  A Bag-of-Tones Model with MFCC Features for Musical Genre Classification , 2013, ADMA.

[36]  Yiannis Kompatsiaris,et al.  Differential Edit Distance: A Metric for Scene Segmentation Evaluation , 2012, IEEE Transactions on Circuits and Systems for Video Technology.