A spatio-temporal pyramid matching for video retrieval

Highlights? We introduce a content-based video retrieval system for a query video shot. ? The shot boundaries are found using a classifier learnt from a boosting algorithm. ? The similarity of video shots is calculated by spatio-temporal pyramid matching. ? The pyramid-matching kernel includes temporal dimension into the matching schema. ? Experiments using sports and UCF50 shows effectiveness of our method. An efficient video retrieval system is essential to search relevant video contents from a large set of video clips, which typically contain several heterogeneous video clips to match with. In this paper, we introduce a content-based video matching system that finds the most relevant video segments from video database for a given query video clip. Finding relevant video clips is not a trivial task, because objects in a video clip can constantly move over time. To perform this task efficiently, we propose a novel video matching called Spatio-Temporal Pyramid Matching (STPM). Considering features of objects in 2D space and time, STPM recursively divides a video clip into a 3D spatio-temporal pyramidal space and compares the features in different resolutions. In order to improve the retrieval performance, we consider both static and dynamic features of objects. We also provide a sufficient condition in which the matching can get the additional benefit from temporal information. The experimental results show that our STPM performs better than the other video matching methods.

[1]  Dong Xu,et al.  Near Duplicate Identification With Spatially Aligned Pyramid Matching , 2010, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Dong Xu,et al.  Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Latifur Khan,et al.  Image annotations by combining multiple evidence & wordNet , 2005, ACM Multimedia.

[5]  Meng Wang,et al.  Automatic video annotation by semi-supervised learning with kernel density estimation , 2006, MM '06.

[6]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[7]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[8]  Rainer Lienhart,et al.  Comparison of automatic shot boundary detection algorithms , 1998, Electronic Imaging.

[9]  Kristen Grauman,et al.  Efficiently searching for similar images , 2010, Commun. ACM.

[10]  B. Li,et al.  Event detection and summarization in sports video , 2001, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL 2001).

[11]  Luo Si,et al.  Effective automatic image annotation via a coherent language model and active learning , 2004, MULTIMEDIA '04.

[12]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[14]  Donald A. Adjeroh,et al.  A Distance Measure for Video Sequences , 1999, Comput. Vis. Image Underst..

[15]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[16]  David Windridge,et al.  An evaluation of bags-of-words and spatio-temporal shapes for action recognition , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[17]  Shiqiang Yang,et al.  Motion based event recognition using HMM , 2002, Object recognition supported by user interaction for service robots.

[18]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[19]  Adrian Ulges,et al.  Content-based Video Tagging for Online Video Portals ∗ , 2007 .

[20]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[21]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Eugenio Di Sciascio,et al.  Query by Sketch and Relevance Feedback for Content-Based Image Retrieval over the Web , 1999, J. Vis. Lang. Comput..

[23]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[24]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[25]  Video Libraries Proceedings IEEE Workshop on Content-based Access of Image and Video Libraries, (CBAIVL 2001),14 December 2001, Kauai, Hawaii , 2001 .

[26]  Shuang Liang,et al.  Sketch retrieval and relevance feedback with biased SVM classification , 2008, Pattern Recognit. Lett..

[27]  Mei Han,et al.  Extract highlights from baseball game video with hidden Markov models , 2002, Proceedings. International Conference on Image Processing.

[28]  Mei Han,et al.  Maximum entropy model-based baseball highlight detection and classification , 2004, Comput. Vis. Image Underst..

[29]  Beng Chin Ooi,et al.  Towards effective indexing for very large video sequence database , 2005, SIGMOD '05.

[30]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[31]  Yinghui Xu,et al.  Automatic image tagging as a random walk with priors on the canonical correlation subspace , 2008, MIR '08.

[32]  Changsheng Xu,et al.  Personalized retrieval of sports video , 2007, MIR '07.

[33]  Qi Tian,et al.  Fast and robust short video clip search using an index structure , 2004, MIR '04.

[34]  Harpreet S. Sawhney,et al.  Action video retrieval based on atomic action vocabulary , 2008, MIR '08.

[35]  Nobuyuki Yagi,et al.  Baseball video indexing using patternization of scenes and hidden Markov model , 2005, IEEE International Conference on Image Processing 2005.

[36]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[37]  John S. Boreczky,et al.  Comparison of video shot boundary detection techniques , 1996, J. Electronic Imaging.

[38]  Zi Huang,et al.  Statistical summarization of content features for fast near-duplicate video detection , 2007, ACM Multimedia.

[39]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[40]  Alberto Del Bimbo,et al.  Automatic video annotation using ontologies extended with visual information , 2005, MULTIMEDIA '05.

[41]  Won Jong Jeon,et al.  Spatio-temporal pyramid matching for sports videos , 2008, MIR '08.

[42]  David G. Lowe,et al.  Towards a Computational Model for Object Recognition in IT Cortex , 2000, Biologically Motivated Computer Vision.

[43]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[44]  Jianping Fan,et al.  Automatic image annotation by incorporating feature hierarchy and boosting to scale up SVM classifiers , 2006, MM '06.

[45]  Justin Zobel,et al.  Fast video matching with signature alignment , 2003, MIR '03.

[46]  Koichi Shinoda,et al.  A robust scene recognition system for baseball broadcast using data-driven approach , 2007, CIVR '07.

[47]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[48]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[49]  Qi Tian,et al.  A unified framework for semantic shot representation of sports video , 2005, MIR '05.

[50]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[52]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.