Semantic Concept Detection based on Spatial Pyramid Matching and Semi-supervised Learning

Analyzing video for semantic content is very important for finding the desired video among a huge amount of accumulated video data. One conventional method for detecting objects depicted in video is called the bag-of-visual-words method, and is based on local feature occurrence frequencies. We propose a method that improves on the detection accuracy of traditional method by dividing video frames into overlapped sub-regions of various sizes. The method computes local and global features for each of these sub-regions to reflect spatial positioning in the feature vectors. These changes ensure that the method is resistant to variations in the size and position of objects appearing in the video. We also propose a training framework based on semi-supervised learning that uses a small number of labeled data points as a starting point and generates additional labeled training data efficiently, with few errors. Experiments using a video data set confirmed improved detection accuracy over earlier methods.

[1]  Chong-Wah Ngo,et al.  Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search , 2008, TRECVID.

[2]  Qi Tian,et al.  Spatial pooling of heterogeneous features for image applications , 2012, ACM Multimedia.

[3]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[4]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.

[5]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[6]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  John R. Smith,et al.  Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[8]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[9]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[10]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[11]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[12]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[13]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[15]  Stéphane Ayache,et al.  Video Corpus Annotation Using Active Learning , 2008, ECIR.

[16]  Nobuyuki Yagi,et al.  Shot Boundary Detection at TRECVID 2007 , 2007, TRECVID.

[17]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Rebecca Hwa,et al.  Co-training for Predicting Emotions with Spoken Dialogue Data , 2004, ACL.

[19]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[20]  Shin'ichi Satoh,et al.  NHK STRL at TRECVID 2010: Semantic Indexing and Surveillance Event Detection , 2010, TRECVID.