Marginalised Stacked Denoising Autoencoders for Robust Representation of Real-Time Multi-View Action Recognition

Multi-view action recognition has gained a great interest in video surveillance, human computer interaction, and multimedia retrieval, where multiple cameras of different types are deployed to provide a complementary field of views. Fusion of multiple camera views evidently leads to more robust decisions on both tracking multiple targets and analysing complex human activities, especially where there are occlusions. In this paper, we incorporate the marginalised stacked denoising autoencoders (mSDA) algorithm to further improve the bag of words (BoWs) representation in terms of robustness and usefulness for multi-view action recognition. The resulting representations are fed into three simple fusion strategies as well as a multiple kernel learning algorithm at the classification stage. Based on the internal evaluation, the codebook size of BoWs and the number of layers of mSDA may not significantly affect recognition performance. According to results on three multi-view benchmark datasets, the proposed framework improves recognition performance across all three datasets and outputs record recognition performance, beating the state-of-art algorithms in the literature. It is also capable of performing real-time action recognition at a frame rate ranging from 33 to 45, which could be further improved by using more powerful machines in future applications.

[1]  Alexandros Iosifidis,et al.  Multi-view human movement recognition based on fuzzy distances and linear discriminant analysis , 2012, Comput. Vis. Image Underst..

[2]  Pascal Fua,et al.  Making Action Recognition Robust to Occlusions and Viewpoint Changes , 2010, ECCV.

[3]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[4]  Mahmood Fathy,et al.  Multi-View Human Activity Recognition in Distributed Camera Sensor Networks , 2013, Sensors.

[5]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[6]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[7]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[8]  Kilian Q. Weinberger,et al.  Marginalized Stacked Denoising Autoencoders , 2012 .

[9]  Yoshua Bengio,et al.  Large-Scale Learning of Embeddings with Reconstruction Sampling , 2011, ICML.

[10]  Bingbing Ni,et al.  RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, ICCV Workshops.

[11]  Francisco Javier Ferrández Pastor,et al.  A Vision-Based System for Intelligent Monitoring: Human Behaviour Analysis and Privacy by Context , 2014, Sensors.

[12]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[13]  Alexandros André Chaaraoui,et al.  Silhouette-based human action recognition using sequences of key poses , 2013, Pattern Recognit. Lett..

[14]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[15]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[17]  Miguel A. Patricio,et al.  A probabilistic, discriminative and distributed system for the recognition of human actions from multiple views , 2012, Neurocomputing.

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Kilian Q. Weinberger,et al.  From sBoW to dCoT marginalized encoders for text representation , 2012, CIKM '12.

[20]  Qi Tian,et al.  Human Daily Action Analysis with Multi-view and Color-Depth Data , 2012, ECCV Workshops.

[21]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[22]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[23]  Thomas B. Moeslund,et al.  A Local 3-D Motion Descriptor for Multi-View Human Action Recognition from 4-D Spatio-Temporal Interest Points , 2012, IEEE Journal of Selected Topics in Signal Processing.

[24]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[25]  Alexandros Iosifidis,et al.  Neural representation and learning for multi-view human action recognition , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[26]  Ling Shao,et al.  Multi-view action recognition using local similarity random forests and sensor fusion , 2013, Pattern Recognit. Lett..

[27]  Mohammad Rahmati,et al.  View-independent action recognition: a hybrid approach , 2016, Multimedia Tools and Applications.

[28]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Gertjan J. Burghouts,et al.  Improved action recognition by combining multiple 2D views in the bag-of-words model , 2013, 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[31]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  V. Ramasubramanian,et al.  Towards fast, view-invariant human action recognition , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[33]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[34]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[35]  Miguel A. Patricio,et al.  Human action recognition with sparse classification and multiple‐view learning , 2014, Expert Syst. J. Knowl. Eng..

[36]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[37]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[38]  Mubarak Shah,et al.  Learning 4D action feature models for arbitrary view action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Feng Gu,et al.  A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition , 2014, IWAAL.

[41]  Ralph Linsker,et al.  An Application of the Principle of Maximum Information Preservation to Linear Systems , 1988, NIPS.