A semantic-based video scene segmentation using a deep neural network

Video scene segmentation is very important research in the field of computer vision, because it helps in efficient storage, indexing and retrieval of videos. Achieving this kind of scene segmentation cannot be done by just calculating the similarity of low-level features presented in the video; high-level features should also be considered to achieve a better performance. Even though much research has been conducted on video scene segmentation, most of these studies failed to semantically segment a video into scenes. Thus, in this study, we propose a Deep-learning Semantic-based Scene-segmentation model (called DeepSSS) that considers image captioning to segment a video into scenes semantically. First, the DeepSSS performs shot boundary detection by comparing colour histograms and then employs maximum-entropy-applied keyframe extraction. Second, for semantic analysis, using image captioning that benefits from deep learning generates a semantic text description of the keyframes. Finally, by comparing and analysing the generated texts, it assembles the keyframes into a scene grouped under a semantic narrative. That said, DeepSSS considers both low- and high-level features of videos to achieve a more meaningful scene segmentation. By applying DeepSSS to data sets from MS COCO for caption generation and evaluating its semantic scene-segmentation task results with the data sets from TRECVid 2016, we demonstrate quantitatively that DeepSSS outperforms other existing scene-segmentation methods using shot boundary detection and keyframes. What’s more, the experiments were done by comparing scenes segmented by humans and scene segmented by the DeepSSS. The results verified that the DeepSSS’ segmentation resembled that of humans. This is a new kind of result that was enabled by semantic analysis, which was impossible by just using low-level features of videos.

[1]  Hermann Ney,et al.  Features for Image Retrieval: A Quantitative Comparison , 2004, DAGM-Symposium.

[2]  Steven C. H. Hoi,et al.  Chinese University of Hong Kong at TRECVID 2006: Shot Boundary Detection and Video Search , 2006, TRECVID.

[3]  László Böszörményi,et al.  State-of-the-art and future challenges in video scene detection: a survey , 2013, Multimedia Systems.

[4]  Yi Huo,et al.  Effective algorithms for video shot and scene boundaries detection , 2016, 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS).

[5]  Mubarak Shah,et al.  A general framework for temporal video scene segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Ahmad Reza Naghsh-Nilchi,et al.  Incorporating social media comments in affective video retrieval , 2016, J. Inf. Sci..

[7]  Guoliang Lu,et al.  Unsupervised, efficient and scalable key-frame selection for automatic summarization of surveillance videos , 2017, Multimedia Tools and Applications.

[8]  Irena Koprinska,et al.  Temporal video segmentation: A survey , 2001, Signal Process. Image Commun..

[9]  Shih-Fu Chang,et al.  A Framework for Sub-Window Shot Detection , 2005, 11th International Multimedia Modelling Conference.

[10]  Danial Hooshyar,et al.  Early Diagnosis of Dementia from Clinical Data by Machine Learning Techniques , 2017 .

[11]  Yonghwan Kim,et al.  Exploring characteristics of video consuming behaviour in different social media using K-pop videos , 2014, J. Inf. Sci..

[12]  Marcel Worring,et al.  Systematic evaluation of logical story unit segmentation , 2002, IEEE Trans. Multim..

[13]  Mubarak Shah,et al.  A Graph Theoretic Approach for Scene Detection in Produced Videos , 2003 .

[14]  Danial Hooshyar,et al.  Developing a hybrid collaborative filtering recommendation system with opinion mining on purchase review , 2018, J. Inf. Sci..

[15]  Alan F. Smeaton,et al.  A bibliometric study of Video Retrieval Evaluation Benchmarking (TRECVid): A methodological analysis , 2011, J. Inf. Sci..

[16]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Fabio Persia,et al.  Discovering the Top-k Unexplained Sequences in Time-Stamped Observation Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[18]  Peter G. B. Enser,et al.  The evolution of visual information retrieval , 2008, J. Inf. Sci..

[19]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[21]  Dan E. Albertson Examining feedback in interactive video retrieval , 2012, J. Inf. Sci..

[22]  Ting Liu,et al.  Video Segmentation via Temporal Pattern Classification , 2007, IEEE Transactions on Multimedia.

[23]  Mubarak Shah,et al.  Scene detection in Hollywood movies and TV shows , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[24]  Hermann Ney,et al.  Features for image retrieval: an experimental comparison , 2008, Information Retrieval.

[25]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[28]  Aristidis Likas,et al.  Weighted multi-view key-frame extraction , 2016, Pattern Recognit. Lett..

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Christoph Meinel,et al.  Image Captioning with Deep Bidirectional LSTMs , 2016, ACM Multimedia.

[31]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[32]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  H. Wactlar,et al.  The Challenges of Continuous Capture , Contemporaneous Analysis , and Customized Summarization of Video Content , 2001 .

[34]  Sarah V. Porter,et al.  Video Segmentation and Indexing using Motion Estimation , 2004 .