Discriminative Multi-View Subspace Feature Learning for Action Recognition

Although deep features have achieved the state-of-the-art performance in action recognition recently, the hand-crafted shallow features still play a critical role in characterizing human actions for taking advantage of visual contents in an intuitive way such as edge features. Therefore, the shallow features can serve as auxiliary visual cues supplementary to deep representations. In this paper, we propose a discriminative subspace learning model (DSLM) to explore the complementary properties between the hand-crafted shallow feature representations and the deep features. As for the RGB action recognition, this is the first work attempting to mine multi-level feature complementaries by the multi-view subspace learning scheme. To sufficiently capture the complementary information among heterogeneous features, we construct the DSLM by integrating the multi-view reconstruction error and classification error into an unified objective function. To be specific, we first use Fisher Vector to encode improved dense trajectories (iDT+FV) for shallow representations and two-stream convolutional neural network models (T-CNN) for generating deep features. Moreover, the presented DSLM algorithm projects multi-level features onto a shared discriminative subspace with the complementary information and discriminating capacity simultaneously incorporated. Finally, the action types of test samples are identified by the margins from the learned compact representations to the decision boundary. The experimental results on three datasets demonstrate the effectiveness of the proposed method.

[1]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Feiping Nie,et al.  Multi-view Feature Learning with Discriminative Regularization , 2017, IJCAI.

[4]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[8]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[9]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[12]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[14]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Rui Hou,et al.  Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[17]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Yongdong Zhang,et al.  Context-Aware Visual Policy Network for Fine-Grained Image Captioning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[20]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[22]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[23]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Dacheng Tao,et al.  Multi-view Self-Paced Learning for Clustering , 2015, IJCAI.

[25]  Cordelia Schmid,et al.  Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[26]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[27]  Feiping Nie,et al.  Multi-View Correlated Feature Learning by Uncovering Shared Component , 2017, AAAI.

[28]  Jiebo Luo,et al.  Action Recognition With Spatio–Temporal Visual Attention on Skeleton Image Sequences , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[30]  Meng Wang,et al.  Robust Multiview Feature Learning for RGB-D Image Understanding , 2015, ACM Trans. Intell. Syst. Technol..

[31]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32]  Feiping Nie,et al.  Multi-View Clustering and Feature Learning via Structured Sparsity , 2013, ICML.

[33]  Dacheng Tao,et al.  Multi-View Intact Space Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Xinghao Jiang,et al.  Two-Stream Dictionary Learning Architecture for Action Recognition , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Peng Wang,et al.  Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[37]  Ling Shao,et al.  Kernelized Multiview Projection for Robust Action Recognition , 2016, International Journal of Computer Vision.

[38]  Cordelia Schmid,et al.  Encoding Feature Maps of CNNs for Action Recognition , 2015 .

[39]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[40]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Sridha Sridharan,et al.  Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[43]  Lei Wang,et al.  A Scalable Unsupervised Feature Merging Approach to Efficient Dimensionality Reduction of High-Dimensional Visual Data , 2013, 2013 IEEE International Conference on Computer Vision.

[44]  Ling Shao,et al.  Kernelized Multiview Projection , 2015, ArXiv.

[45]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).