User Video Summarization Based on Joint Visual and Semantic Affinity Graph

Automatically generating summaries of user-generated videos is very useful but challenging. User-generated videos are unedited and usually only contain a long single shot which makes traditional video temporal segmentation methods such as shot boundary detection ineffective in producing meaningful video segments for summarization. To address this issue, we propose a novel temporal segmentation framework based on the clustering of joint visual and semantic affinity graph of the video frames. Based on a pre-trained deep convolutional neural network (CNN), we extract deep visual features of the frames to construct the visual affinity graph. We then construct the semantic affinity graph of the frames based on word embedding of the frames' semantic tags generated from an automatic image tagging algorithm. A dense neighbor method is then used to cluster the joint visual and semantic affinity graph to divide the video into sub-shot level segments and from which a summary of the video can be generated. Experimental results show that our approach outperforms state-of-the-art methods. Furthermore, we show that the method achieves results that are similar to those performed manually.

[1]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[2]  Dhaval S. Pipalia,et al.  (Chi-Square) Based Shot Boundary Detection and Key Frame Extraction for Video , 2013 .

[3]  Bo Zhang,et al.  A Formal Study of Shot Boundary Detection , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Shuicheng Yan,et al.  Robust Clustering as Ensembles of Affinity Relations , 2010, NIPS.

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  Mei Han,et al.  Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Ron Shonkwiler Computing the Hausdorff Set Distance in Linear Time for Any L_p Point Distance , 1991, Inf. Process. Lett..

[10]  Patrick Bouthemy,et al.  A unified approach to shot change detection and camera motion characterization , 1999, IEEE Trans. Circuits Syst. Video Technol..

[11]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[12]  Ullas Gargi,et al.  Performance characterization of video-shot-change detection methods , 2000, IEEE Trans. Circuits Syst. Video Technol..

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[15]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[18]  Ivan Laptev,et al.  Track to the future: Spatio-temporal video segmentation with long-range motion cues , 2011, CVPR 2011.