ARCH: Adaptive recurrent-convolutional hybrid networks for long-term action recognition

Recognition of human actions from digital video is a challenging task due to complex interfering factors in uncontrolled realistic environments. In this paper, we propose a learning framework using static, dynamic and sequential mixed features to solve three fundamental problems: spatial domain variation, temporal domain polytrope, and intra- and inter-class diversities. Utilizing a cognitive-based data reduction method and a hybrid "network upon networks" architecture, we extract human action representations which are robust against spatial and temporal interferences and adaptive to variations in both action speed and duration. We evaluated our method on the UCF101 and other three challenging datasets. Our results demonstrated a superior performance of the proposed algorithm in human action recognition.

[1]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[3]  Mikael Bodén,et al.  A guide to recurrent neural networks and backpropagation , 2001 .

[4]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[5]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[6]  Ashok Veeraraghavan,et al.  The Function Space of an Activity , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[9]  Barbara Hammer,et al.  On the approximation capability of recurrent neural networks , 2000, Neurocomputing.

[10]  Svetha Venkatesh,et al.  Learning and detecting activities from movement trajectories using the hierarchical hidden Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Jake K. Aggarwal,et al.  Human motion analysis: a review , 1997, Proceedings IEEE Nonrigid and Articulated Motion Workshop.

[12]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[13]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[14]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[15]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[16]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[17]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[22]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Samy Bengio,et al.  Modeling individual and group actions in meetings with layered HMMs , 2006, IEEE Transactions on Multimedia.

[24]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[25]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[26]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[27]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[28]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[29]  Geoffrey E. Hinton,et al.  Training Recurrent Neural Networks , 2013 .

[30]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[31]  Liang Lin,et al.  Learning latent spatio-temporal compositional model for human action recognition , 2013, MM '13.

[32]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[33]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[34]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[35]  Shaogang Gong,et al.  Recognition of group activities using dynamic probabilistic networks , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[36]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[37]  R. Blake,et al.  Brain Areas Involved in Perception of Biological Motion , 2000, Journal of Cognitive Neuroscience.

[38]  Steve Renals,et al.  THE USE OF RECURRENT NEURAL NETWORKS IN CONTINUOUS SPEECH RECOGNITION , 1996 .

[39]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[40]  Harris V. Georgiou,et al.  Estimating the intrinsic dimension in fMRI space via dataset fractal analysis - Counting the 'cpu cores' of the human brain , 2014, ArXiv.

[41]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[44]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[45]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[48]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[49]  Philip H. S. Torr,et al.  Feature sampling and partitioning for visual vocabulary generation on large action classification datasets , 2014, ArXiv.

[50]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[51]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[52]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[53]  R. Blake,et al.  Brain Areas Active during Visual Perception of Biological Motion , 2002, Neuron.

[54]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[55]  J.K. Aggarwal,et al.  Recognition of High-level Group Activities Based on Activities of Individual Members , 2008, 2008 IEEE Workshop on Motion and video Computing.

[56]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[57]  Fernando De la Torre,et al.  Generalized time warping for multi-modal alignment of human motion , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[60]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[61]  Gregor M. Hörzer,et al.  Theta coupling between V4 and prefrontal cortex predicts visual short-term memory performance , 2012, Nature Neuroscience.

[62]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.