Spectral learning of latent semantics for action recognition

This paper proposes novel spectral methods for learning latent semantics (i.e. high-level features) from a large vocabulary of abundant mid-level features (i.e. visual keywords), which can help to bridge the semantic gap in the challenging task of action recognition. To discover the manifold structure hidden among mid-level features, we develop spectral embedding approaches based on graphs and hypergraphs, without the need to tune any parameter for graph construction which is a key step of manifold learning. In particular, the traditional graphs are constructed by linear reconstruction with sparse coding. In the new embedding space, we learn high-level latent semantics automatically from abundant mid-level features through spectral clustering. The learnt latent semantics can be readily used for action recognition with SVM by defining a histogram intersection kernel. Different from the traditional latent semantic analysis based on topic models, our two spectral methods for semantic learning can discover the manifold structure hidden among mid-level features, which results in compact but discriminative high-level features. The experimental results on two standard action datasets have shown the superior performance of our spectral methods.

[1]  Shaogang Gong,et al.  Recognising action as clouds of space-time interest points , 2009, CVPR.

[2]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[4]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[5]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Yang Yang,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, CVPR.

[7]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[9]  Joshua B. Tenenbaum,et al.  The Isomap Algorithm and Topological Stability , 2002, Science.

[10]  Quanquan Gu,et al.  Learning the Shared Subspace for Multi-task Clustering and Transductive Transfer Classification , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[11]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[12]  Zicheng Liu,et al.  Action detection using multiple spatial-temporal interest point features , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[13]  Rama Chellappa,et al.  View Invariance for Human Action Recognition , 2005, International Journal of Computer Vision.

[14]  Ann B. Lee,et al.  Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[16]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Michael G. Strintzis,et al.  Statistical Motion Information Extraction and Representation for Semantic Video Analysis , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Serge J. Belongie,et al.  Higher order learning with graphs , 2006, ICML.

[21]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[22]  James J. Little,et al.  Tracking and recognizing actions of multiple hockey players using the boosted particle filter , 2009, Image Vis. Comput..

[23]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Jieping Ye,et al.  Hypergraph spectral learning for multi-label classification , 2008, KDD.

[25]  Alberto Del Bimbo,et al.  Effective Codebooks for human action categorization , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[26]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[27]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.