Action Recognition From Arbitrary Views Using Transferable Dictionary Learning

Human action recognition is crucial to many practical applications, ranging from human-computer interaction to video surveillance. Most approaches either recognize the human action from a fixed view or require the knowledge of view angle, which is usually not available in practical applications. In this paper, we propose a novel end-to-end framework to jointly learn a view-invariance transfer dictionary and a view-invariant classifier. The result of the process is a dictionary that can project real-world 2D video into a view-invariant sparse representation, and a classifier to recognize actions with an arbitrary view. The main feature of our algorithm is the use of synthetic data to extract view-invariance between 3D and 2D videos during the pre-training phase. This guarantees the availability of training data, and removes the hassle of obtaining real-world videos in specific viewing angles. Additionally, for better describing the actions in 3D videos, we introduce a new feature set called the 3D dense trajectories to effectively encode extracted trajectory information on 3D videos. Experimental results on the IXMAS, N-UCLA, i3DPost and UWA3DII data sets show improvements over existing algorithms.

[1]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[2]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[3]  Adrian Hilton,et al.  Shape-Colour Histograms for matching 3D video sequences , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[4]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[9]  Chris Hecker,et al.  Real-time motion retargeting to highly varied user-created morphologies , 2008, SIGGRAPH 2008.

[10]  Yu-Chiang Frank Wang,et al.  Recognizing Actions across Cameras by Exploring the Correlated Subspace , 2012, ECCV Workshops.

[11]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[12]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[14]  Yang Yang,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, CVPR.

[15]  Larry S. Davis,et al.  Learning a discriminative dictionary for sparse coding via label consistent K-SVD , 2011, CVPR 2011.

[16]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Ling Shao,et al.  Action Recognition Using 3D Histograms of Texture and A Multi-Class Boosting Classifier , 2017, IEEE Transactions on Image Processing.

[19]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Binlong Li,et al.  Cross-view activity recognition using Hankelets , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Ying Wu,et al.  Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Loong Fah Cheong,et al.  Activity recognition using dense long-duration trajectories , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[23]  ShaoLing,et al.  Weakly-Supervised Cross-Domain Dictionary Learning for Visual Recognition , 2014 .

[24]  Marcel Körtgen,et al.  3 D Shape Matching with 3 D Shape Contexts , 2003 .

[25]  Kangkan Wang,et al.  Templateless Non-Rigid Reconstruction and Motion Tracking With a Single RGB-D Camera. , 2017, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[26]  Hubert P. H. Shum,et al.  Multi‐layer Lattice Model for Real‐Time Dynamic Character Deformation , 2015, Comput. Graph. Forum.

[27]  Ajmal S. Mian,et al.  Learning a non-linear knowledge transfer model for cross-view action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[29]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[30]  Ruonan Li,et al.  Discriminative virtual views for cross-view action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  James J. Little,et al.  3D Pose from Motion for Cross-View Action Recognition via Non-linear Circulant Temporal Encoding , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[34]  Mubarak Shah,et al.  Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories , 2011, 2011 International Conference on Computer Vision.

[35]  Marcel Körtgen,et al.  3D Shape Matching with 3D Shape Contexts , 2003 .

[36]  Ling Shao,et al.  Arbitrary view action recognition via transfer dictionary learning on synthetic training data , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[37]  Gene H. Golub,et al.  Tikhonov Regularization and Total Least Squares , 1999, SIAM J. Matrix Anal. Appl..

[38]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[39]  Ralph Gross,et al.  The CMU Motion of Body (MoBo) Database , 2001 .

[40]  Christopher Joseph Pal,et al.  Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[41]  Edward K. Wong,et al.  Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition , 2016, Image Vis. Comput..

[42]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[43]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[44]  Guillermo Sapiro,et al.  Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations , 2009, NIPS.

[45]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[46]  Hong Liu,et al.  Energy-Based Global Ternary Image for Action Recognition Using Sole Depth Sequences , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[47]  Yun Fu,et al.  Discriminative Relational Representation Learning for RGB-D Action Recognition , 2016, IEEE Transactions on Image Processing.

[48]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Thierry Dutoit,et al.  A DATABASE FOR STYLISTIC HUMAN GAIT MODELING AND SYNTHESIS , 2008 .

[50]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[51]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[52]  Ali Farhadi,et al.  Learning to Recognize Activities from the Wrong View Point , 2008, ECCV.

[53]  Ioannis Pitas,et al.  The i3DPost Multi-View and 3D Human Action/Interaction Database , 2009, 2009 Conference for Visual Media Production.

[54]  Mubarak Shah,et al.  Learning 4D action feature models for arbitrary view action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Pinar Duygulu Sahin,et al.  A new pose-based representation for recognizing actions from multiple cameras , 2011, Comput. Vis. Image Underst..

[56]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Martial Hebert,et al.  Trajectons: Action recognition through the motion analysis of tracked features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[58]  Rama Chellappa,et al.  Cross-View Action Recognition via Transferable Dictionary Learning , 2016, IEEE Transactions on Image Processing.

[59]  Rama Chellappa,et al.  Domain Adaptive Dictionary Learning , 2012, ECCV.

[60]  Hans-Peter Kriegel,et al.  3D Shape Histograms for Similarity Search and Classification in Spatial Databases , 1999, SSD.

[61]  Chunheng Wang,et al.  Cross-View Action Recognition via a Continuous Virtual Path , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[63]  Ioannis Pitas,et al.  3D Human Action Recognition for Multi-view Camera Systems , 2011, 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission.

[64]  Ling Shao,et al.  Weakly-Supervised Cross-Domain Dictionary Learning for Visual Recognition , 2014, International Journal of Computer Vision.

[65]  Isaac Cohen,et al.  Inference of human postures by classification of 3D human body shape , 2003, 2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443).

[66]  Augusto Sarti,et al.  3-D Body Posture Tracking For Human Action Template Matching , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[67]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[68]  Ajmal Mian,et al.  Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016 .

[69]  Chengcheng Jia,et al.  Low-Rank Tensor Subspace Learning for RGB-D Action Recognition. , 2016 .

[70]  Silvio Savarese,et al.  Cross-view action recognition via view knowledge transfer , 2011, CVPR 2011.