Learning a sparse dictionary of video structure for activity modeling

We present an approach which incorporates spatiotemporal features as well as the relationships between them, into a sparse dictionary learning framework for activity recognition. We propose that the dictionary learning framework can be adapted to learning complex relationships between features in an unsupervised manner. From a set of training videos, a dictionary is learned for individual features, as well as the relationships between them using a stacked predictive sparse decomposition framework. This combined dictionary provides a representation of the structure of the video and is spatio-temporally pooled in a local manner to obtain descriptors. The descriptors are then combined using a multiple kernel learning framework to design classifiers. Experiments have been conducted on two popular activity recognition datasets to demonstrate the superior performance of our approach on single person as well as multi-person activities.

[1]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[2]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[3]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[5]  Amit K. Roy-Chowdhury,et al.  A “string of feature graphs” model for recognition of complex activities in natural videos , 2011, 2011 International Conference on Computer Vision.

[6]  Mohamed R. Amer,et al.  Sum-product networks for modeling activities with stochastic structure , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Rama Chellappa,et al.  Sparse dictionary-based representation and recognition of action attributes , 2011, 2011 International Conference on Computer Vision.

[8]  Geoffrey E. Hinton,et al.  A Better Way to Pretrain Deep Boltzmann Machines , 2012, NIPS.

[9]  Xiaochun Cao,et al.  Action recognition based on spatial-temporal pyramid sparse coding , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[10]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[14]  Yunde Jia,et al.  Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[15]  Christian Vollmer,et al.  Learning Features for Activity Recognition with Shift-Invariant Sparse Coding , 2013, ICANN.

[16]  Fei-Fei Li,et al.  Online detection of unusual events in videos via dynamic sparse coding , 2011, CVPR 2011.

[17]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[18]  Honglak Lee,et al.  Tutorial on Deep Learning and Applications , 2010 .