A novel feature fusion based framework for efficient shot indexing to massive web videos

This study addresses an automatic approach to analyze the structure of large scale web videos based on visual and acoustic information. In our approach, video streams are macro-segmented via mining the duplicate sequences. Acoustic and visual information are both adopted for mining so as to avoid missing true-positive. Web videos contain severe visual and acoustic distortions, differing to TV data, where duplicate clips are quite similar. In this case, we present novel visual-acoustic feature schemes to handle the distortions. And shot based indexing algorithm and several temporary constrains are presented to mine the duplicate sequences, where the weak geometric verification is combined with direct hashing to achieve high efficiency and superior performance of image-based duplicate sequences detection, and dynamic programming is introduced to recall missing true-positives in audio-based section. Experiments conducted on the dataset composed of 500 h content-unknown videos show that F-Measure of duplicate sequences mining for web videos can achieve the rate of 95 % and, in terms of efficiency and detection performance, the proposed algorithm outperforms the state-of-art approaches.

[1]  Shumeet Baluja,et al.  Advertisement Detection and Replacement using Acoustic and Visual Repetition , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[2]  Jiying Zhao,et al.  A MPEG video structure analysis scheme and its application to hierarchical video browser , 1998, Telecommun. Syst..

[3]  Sid-Ahmed Berrani,et al.  A non-supervised approach for repeated sequence detection in TV broadcast streams , 2008, Signal Process. Image Commun..

[4]  Paul Over,et al.  Video shot boundary detection: Seven years of TRECVid activity , 2010, Comput. Vis. Image Underst..

[5]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[6]  John M. Gauch,et al.  Finding and identifying unknown commercials using repeated video sequence detection , 2006, Comput. Vis. Image Underst..

[7]  Changsheng Xu,et al.  Segmentation, categorization, and identification of commercial clips from TV streams using multimodal analysis , 2006, MM '06.

[8]  Yuan Dong,et al.  Advanced news video parsing via visual characteristics of anchorperson scenes , 2013, Telecommun. Syst..

[9]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[10]  Gerald Schaefer,et al.  Fuzzy clustering for colour reduction in images , 2009, Telecommun. Syst..

[11]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Wei Liu,et al.  Contented-Based Large Scale Web Audio Copy Detection , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[13]  Wei Liu,et al.  A fast color feature for real-time image retrieval , 2012, 2012 3rd IEEE International Conference on Network Infrastructure and Digital Content.

[14]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Fei Wang,et al.  Real-time large scale near-duplicate web video retrieval , 2010, ACM Multimedia.

[16]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[17]  Derek Hoiem,et al.  Computer vision for music identification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Yuan Dong,et al.  TV program segmentation using multi-modal information fusion , 2011, ICMR.

[19]  Ruud M. Bolle,et al.  Comparison of sequence matching techniques for video copy detection , 2001, IS&T/SPIE Electronic Imaging.

[20]  Sid-Ahmed Berrani,et al.  TV broadcast macro-segmentation: metadata-based vs. content-based approaches , 2007, CIVR '07.

[21]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Jaap A. Haitsma,et al.  Robust Audio Hashing for Content Identification , 2001 .

[23]  Arnold W. M. Smeulders,et al.  Real-Time Visual Concept Classification , 2010, IEEE Transactions on Multimedia.

[24]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[25]  Peter H. Sellers,et al.  An Algorithm for the Distance Between Two Finite Sequences , 1974, J. Comb. Theory, Ser. A.