Discovering compact topical descriptors for web video retrieval

Describing videos efficiently is an important task for content based web video retrieval. To solve this problem, we propose an unsupervised approach based on an undirected topic model to learn a compact topical descriptor upon the bag-of-words (BoW) video representation. In our method, words in a BoW are assumed to have different topic features, and the topical descriptor of an entire video is obtained by aggregating those features, which makes the descriptor contain information about relative strength of topics. To improve the descriptor interpretability, an L1 penalty is used to control the topical sparsity. Furthermore, efficient learning and inference algorithms are presented. We evaluate the proposed descriptor on the Columbia Consumer Video dataset. Experimental results demonstrate that compared with the BoW and other topical representations, the proposed compact descriptor has better performance in web video retrieval.

[1]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[2]  Eric P. Xing,et al.  Sparse Topical Coding , 2011, UAI.

[3]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[4]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[6]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[7]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[8]  Bernd Girod,et al.  CHoG: Compressed histogram of gradients A low bit-rate feature descriptor , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Wen Gao,et al.  Towards compact topical descriptors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[11]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[12]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Cordelia Schmid,et al.  Compact Video Description for Copy Detection with Precise Temporal Alignment , 2010, ECCV.

[16]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.