Semi-Supervised Multiple Feature Analysis for Action Recognition

This paper presents a semi-supervised method for categorizing human actions using multiple visual features. The proposed algorithm simultaneously learns multiple features from a small number of labeled videos, and automatically utilizes data distributions between labeled and unlabeled data to boost the recognition performance. Shared structural analysis is applied in our approach to discover a common subspace shared by each type of feature. In the subspace, the proposed algorithm is able to characterize more discriminative information of each feature type. Additionally, data distribution information of each type of feature has been preserved. The aforementioned attributes make our algorithm robust for action recognition, especially when only limited labeled training samples are provided. Extensive experiments have been conducted on both the choreographed and the realistic video datasets, including KTH, Youtube action and UCF50. Experimental results show that our method outperforms several state-of-the-art algorithms. Most notably, much better performances have been achieved when there are only a few labeled training samples.

[1]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[2]  Zi Huang,et al.  Multi-Feature Fusion via Hierarchical Regression for Multimedia Analysis , 2013, IEEE Transactions on Multimedia.

[3]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  Liang Wang,et al.  Informative Shape Representations for Human Action Recognition , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[5]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[6]  Yi Yang,et al.  Action recognition by exploring data distribution and feature correlation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[8]  Zhiwu Lu,et al.  Combining Context, Consistency, and Diversity Cues for Interactive Image Categorization , 2010, IEEE Transactions on Multimedia.

[9]  Steven C. H. Hoi,et al.  Two-View Transductive Support Vector Machines , 2010, SDM.

[10]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[11]  Yi Yang,et al.  Harry Potter's Marauder's Map: Localizing and Tracking Multiple Persons-of-Interest by Nonnegative Discretization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[13]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[14]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  John Shawe-Taylor,et al.  Two view learning: SVM-2K, Theory and Practice , 2005, NIPS.

[17]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Nicu Sebe,et al.  Web Image Annotation Via Subspace-Sparsity Collaborated Feature Selection , 2012, IEEE Transactions on Multimedia.

[19]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Michael R. Lyu,et al.  A Multimodal and Multilevel Ranking Scheme for Large-Scale Video Retrieval , 2008, IEEE Transactions on Multimedia.

[22]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Jieping Ye,et al.  A shared-subspace learning framework for multi-label classification , 2010, TKDD.

[24]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[25]  Xinghua Sun,et al.  Action recognition via local descriptors and holistic features , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[26]  Rama Chellappa,et al.  Cross-View Action Recognition via a Transferable Dictionary Pair , 2012, BMVC.

[27]  Nuno Vasconcelos,et al.  TaylorBoost: First and second-order boosting algorithms with explicit margin control , 2011, CVPR 2011.

[28]  Rama Chellappa,et al.  Domain adaptation for object recognition: An unsupervised approach , 2011, 2011 International Conference on Computer Vision.

[29]  Yang Wang,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Yi Yang,et al.  Web and Personal Image Annotation by Mining Label Correlation With Relaxed Visual Graph Embedding , 2012, IEEE Transactions on Image Processing.

[32]  Nicu Sebe,et al.  Exploiting the entire feature space with sparsity for automatic image annotation , 2011, ACM Multimedia.

[33]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[34]  Yueting Zhuang,et al.  Adaptive Unsupervised Multi-view Feature Selection for Visual Concept Recognition , 2012, ACCV.

[35]  Changsheng Xu,et al.  A Generic Framework for Video Annotation via Semi-Supervised Learning , 2012, IEEE Transactions on Multimedia.

[36]  Edmond Boyer,et al.  Action recognition using exemplar-based embedding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..