Modeling correlation between multi-modal continuous words for pLSA-based video classification

Seeing that probabilistic Latent Semantic Analysis (pLSA) deals with discrete quantity only, pLSA with Gaussian Mixtures (GM-pLSA) extends it to continuous feature space by treating continuous feature as continuous word. However, GM-pLSA does not provide a clear way of modeling multimodal features, and also neglects the intrinsic correlation between these continuous words. In this paper, we present a graph regularized multi-modal GM-pLSA (GRMMGM-pLSA) model to incorporate such correlation between multimodal continuous words into the process of model learning. First, multiple GMMs are adopted with each depicting the distribution of continuous words from each modality; and then, a graph regularizer is introduced to capture the word correlation. In the task of video classification, GRMMGM-pLSA that takes both multi-modal visual features of sub-shots and word correlation in terms of temporal consistency between sub-shots into account is exploited to perform feature mapping. Experiments on YouTube videos show the effectiveness of our proposed model.

[1]  Xi Liu,et al.  Modeling continuous visual features for semantic image annotation and retrieval , 2011, Pattern Recognit. Lett..

[2]  Shaogang Gong,et al.  Action categorization by structural probabilistic latent semantic analysis , 2010, Comput. Vis. Image Underst..

[3]  Zhiwu Lu,et al.  Image categorization via robust pLSA , 2010, Pattern Recognit. Lett..

[4]  Tae-Kyun Kim,et al.  Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  C. Goutte,et al.  Co-Occurrence Models in Music Genre Classification , 2005, 2005 IEEE Workshop on Machine Learning for Signal Processing.

[6]  R. Lienhart,et al.  Continuous visual vocabulary modelsfor pLSA-based scene recognition , 2008, CIVR '08.

[7]  Daniel Gatica-Perez,et al.  Modeling Semantic Aspects for Cross-Media Image Indexing , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jun Yang,et al.  Exploring temporal consistency for video analysis and retrieval , 2006, MIR '06.

[10]  Deng Cai,et al.  Gaussian Mixture Model with Local Consistency , 2010, AAAI.

[11]  Thomas S. Huang,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation. , 2011, IEEE transactions on pattern analysis and machine intelligence.

[12]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Ling Guan,et al.  Multi-feature pLSA for combining visual features in image annotation , 2011, ACM Multimedia.

[14]  Adrian Ulges,et al.  A System That Learns to Tag Videos by Watching Youtube , 2008, ICVS.

[15]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[16]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.