Semi-Supervised Image-to-Video Adaptation for Video Action Recognition

Human action recognition has been well explored in applications of computer vision. Many successful action recognition methods have shown that action knowledge can be effectively learned from motion videos or still images. For the same action, the appropriate action knowledge learned from different types of media, e.g., videos or images, may be related. However, less effort has been made to improve the performance of action recognition in videos by adapting the action knowledge conveyed from images to videos. Most of the existing video action recognition methods suffer from the problem of lacking sufficient labeled training videos. In such cases, over-fitting would be a potential problem and the performance of action recognition is restrained. In this paper, we propose an adaptation method to enhance action recognition in videos by adapting knowledge from images. The adapted knowledge is utilized to learn the correlated action semantics by exploring the common components of both labeled videos and images. Meanwhile, we extend the adaptation method to a semi-supervised framework which can leverage both labeled and unlabeled videos. Thus, the over-fitting can be alleviated and the performance of action recognition is improved. Experiments on public benchmark datasets and real-world datasets show that our method outperforms several other state-of-the-art action recognition methods.

[1]  S. A. Angadi,et al.  Shot Boundary Detection and Key Frame Extraction for Sports Video Summarization Based on Spectral Entropy and Mutual Information , 2013 .

[2]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Ling Shao,et al.  Discriminative Tracking Using Tensor Pooling , 2016, IEEE Transactions on Cybernetics.

[4]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[5]  Ivor W. Tsang,et al.  Learning with Augmented Features for Heterogeneous Domain Adaptation , 2012, ICML.

[6]  Wei Liu,et al.  Large Graph Construction for Scalable Semi-Supervised Learning , 2010, ICML.

[7]  Ling Shao,et al.  Structure-Preserving Binary Representations for RGB-D Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Avishek Saha,et al.  Co-regularization Based Semi-supervised Domain Adaptation , 2010, NIPS.

[9]  Yi Yang,et al.  Semisupervised Feature Selection via Spline Regression for Video Semantic Recognition , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[11]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.

[12]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[13]  Changyin Sun,et al.  Action Recognition Using Nonnegative Action Component Representation and Sparse Basis Selection , 2014, IEEE Transactions on Image Processing.

[14]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yonina C. Eldar,et al.  Average Case Analysis of Multichannel Sparse Recovery Using Convex Relaxation , 2009, IEEE Transactions on Information Theory.

[16]  Yi Yang,et al.  Image Attribute Adaptation , 2014, IEEE Transactions on Multimedia.

[17]  Chris H. Q. Ding,et al.  R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization , 2006, ICML.

[18]  Ling Shao,et al.  Learning Discriminative Key Poses for Action Recognition , 2013, IEEE Transactions on Cybernetics.

[19]  Ling Shao,et al.  Kernelized Multiview Projection for Robust Action Recognition , 2016, International Journal of Computer Vision.

[20]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[21]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[22]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[23]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[24]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yunde Jia,et al.  Transfer Latent SVM for Joint Recognition and Localization of Actions in Videos , 2016, IEEE Transactions on Cybernetics.

[27]  Barbara Caputo,et al.  Learning to Learn, from Transfer Learning to Domain Adaptation: A Unifying Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Yi Yang,et al.  A Multimedia Retrieval Framework Based on Semi-Supervised Ranking and Relevance Feedback , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Ching-Yung Lin,et al.  Video Collaborative Annotation Forum: Establishing Ground-Truth Labels on Large Multimedia Datasets , 2003, TRECVID.

[30]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[31]  David Windridge,et al.  Multilevel Chinese Takeaway Process and Label-Based Processes for Rule Induction in the Context of Automated Sports Video Annotation , 2014, IEEE Transactions on Cybernetics.

[32]  Trevor Darrell,et al.  Efficient Learning of Domain-invariant Image Representations , 2013, ICLR.

[33]  Zi Huang,et al.  Transfer tagging from image to video , 2011, ACM Multimedia.

[34]  Rong Yan,et al.  Cross-domain video concept detection using adaptive svms , 2007, ACM Multimedia.

[35]  Nicu Sebe,et al.  Thinking of Images as What They Are: Compound Matrix Regression for Image Classification , 2013, IJCAI.

[36]  Barbara Caputo,et al.  Multiclass transfer learning from unconstrained priors , 2011, 2011 International Conference on Computer Vision.

[37]  Jiawei Han,et al.  Semi-supervised Discriminant Analysis , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[38]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[39]  Ling Shao,et al.  Spatio-Temporal Laplacian Pyramid Coding for Action Recognition , 2014, IEEE Transactions on Cybernetics.

[40]  Ivor W. Tsang,et al.  Visual Event Recognition in Videos by Learning from Web Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[42]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[43]  Ling Shao,et al.  Weakly-Supervised Cross-Domain Dictionary Learning for Visual Recognition , 2014, International Journal of Computer Vision.

[44]  Jianxin Wu,et al.  Good Practices for Learning to Recognize Actions Using FV and VLAD , 2016, IEEE Transactions on Cybernetics.

[45]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[46]  Pinar Duygulu Sahin,et al.  Recognizing actions from still images , 2008, 2008 19th International Conference on Pattern Recognition.

[47]  Ling Shao,et al.  Transfer Learning for Visual Categorization: A Survey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[48]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Dacheng Tao,et al.  This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS 1 Cross-Domain Human Action Recognition , 2022 .

[50]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[51]  Zihan Zhou,et al.  Label Information Guided Graph Construction for Semi-Supervised Learning , 2017, IEEE Transactions on Image Processing.