Motion-Based Representations For Activity Recognition

In this dissertation we propose four different representations based on motion information for activity recognition. The first is a spatiotemporal local feature descriptor that extracts a robust set of statistical measures to describe motion patterns. This descriptor measures meaningful properties of co-occurrence matrices and captures local space-time characteristics of the motion through the neighboring optical flow magnitude and orientation. The second, is the proposal of a compact novel mid-level representation based on co-occurrence matrices of codewords. This representation expresses the distribution of the features at a given offset over feature codewords from a pre-computed codebook and encodes global structures in various local region-based features. The third representation, is the proposal of a novel temporal stream for two-stream convolutional networks that employs images computed from the optical flow magnitude and orientation to learn the motion in a better and richer manner. The method applies simple non-linear transformations on the vertical and horizontal components of the optical flow to generate input images for the temporal stream. Finally, the forth is a novel skeleton image representation to be used as input of convolutional neural networks (CNNs). The proposed approach encodes the temporal dynamics by explicitly computing the magnitude and orientation values of the skeleton joints. Moreover, the representation has the advantage of combining the use of reference joints and a tree structure skeleton, incorporating different spatial relationships between the joints and preserving important spatial relations. The experimental evaluations carried out on challenging well-known activity recognition datasets (KTH, UCF Sports, HMDB51, UCF101, NTU RGB+D 60 and NTU RGB+D 120) demonstrated that the proposed representations achieved better or similar accuracy results in comparison to the state of the art, indicating the suitability of our approaches as video representations.

[1]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[2]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[3]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[4]  Brian C. Lovell,et al.  Improved anomaly detection in crowded scenes via cell-based analysis of foreground speed, size and texture , 2011, CVPR 2011 WORKSHOPS.

[5]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  William Robson Schwartz,et al.  Histograms of Optical Flow Orientation and Magnitude and Entropy to Detect Anomalous Events in Videos , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Kazuhiro Fukui,et al.  Feature Extraction Based on Co-occurrence of Adjacent Local Binary Patterns , 2011, PSIVT.

[9]  William Robson Schwartz,et al.  Histograms of Optical Flow Orientation and Magnitude to Detect Anomalous Events in Videos , 2015, 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images.

[10]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[12]  Nanning Zheng,et al.  View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Takumi Kobayashi,et al.  Image Feature Extraction Using Gradient Local Auto-Correlations , 2008, ECCV.

[14]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Vishal M. Patel,et al.  C2AE: Class Conditioned Auto-Encoder for Open-Set Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Gang Wang,et al.  Early Action Prediction by Soft Regression , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[19]  Feng Shi,et al.  Gradient Boundary Histograms for Action Recognition , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[20]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Monique Thonnat,et al.  Multi-sensors Analysis for Everyday Activity Monitoring , 2007 .

[22]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Gang Wang,et al.  Skeleton-Based Online Action Prediction Using Scale Selection Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[25]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[26]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yong Du,et al.  Skeleton based action recognition with convolutional neural network , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[29]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[30]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[31]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[32]  Yadong Mu,et al.  Two-Stream Video Classification with Cross-Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[33]  Gang Hua,et al.  Integrated feature selection and higher-order spatial feature extraction for object categorization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  J.-Y. Bouguet,et al.  Pyramidal implementation of the lucas kanade feature tracker , 1999 .

[35]  Yi Zhu,et al.  Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition , 2016, ECCV Workshops.

[36]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Ivan Laptev,et al.  Local Descriptors for Spatio-temporal Recognition , 2004, SCVMA.

[38]  Elsevier Sdol,et al.  Journal of Visual Communication and Image Representation , 2009 .

[39]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.

[40]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[41]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[42]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[43]  Jiebo Luo,et al.  Action Recognition With Spatio–Temporal Visual Attention on Skeleton Image Sequences , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[44]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[45]  Ling Shao,et al.  High order co-occurrence of visualwords for action recognition , 2012, 2012 19th IEEE International Conference on Image Processing.

[46]  Amir Roshan Zamir,et al.  Action Recognition in Realistic Sports Videos , 2014 .

[47]  Mohammed Bennamoun,et al.  Learning Clip Representations for Skeleton-Based 3D Action Recognition , 2018, IEEE Transactions on Image Processing.

[48]  Jefersson Alex dos Santos,et al.  SkeleMotion: A New Representation of Skeleton Joint Sequences based on Motion Information for 3D Action Recognition , 2019, 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[49]  Xiangjiu Che,et al.  R-STAN: Residual Spatial-Temporal Attention Network for Action Recognition , 2019, IEEE Access.

[50]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[51]  Lihong Zheng,et al.  A Survey on Human Action Recognition Using Depth Sensors , 2015, 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[52]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[53]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Feng Shi,et al.  Local Part Model for Action Recognition in Realistic Videos , 2014 .

[55]  Satoshi Ito,et al.  Co-occurrence Histograms of Oriented Gradients for Pedestrian Detection , 2009, PSIVT.

[56]  H. Keval CCTV Control Room Collaboration and Communication: Does it Work? , 2006 .

[57]  Jing Wang,et al.  Spatio-temporal texture modelling for real-time crowd anomaly detection , 2016, Comput. Vis. Image Underst..

[58]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[59]  Takumi Kobayashi,et al.  Motion recognition using local auto-correlation of space-time gradients , 2012, Pattern Recognit. Lett..

[60]  Jefersson Alex dos Santos,et al.  Statistical Measures from Co-occurrence of Codewords for Action Recognition , 2018, VISIGRAPP.

[61]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Junsong Yuan,et al.  Recognizing Human Actions as the Evolution of Pose Estimation Maps , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Hong Liu,et al.  Learning spatio-temporal co-occurrence correlograms for efficient human action classification , 2013, 2013 IEEE International Conference on Image Processing.

[64]  Vanessa Testoni,et al.  Video pornography detection through deep learning techniques and motion information , 2016, Neurocomputing.

[65]  Sridha Sridharan,et al.  Textures of optical flow for real-time anomaly detection in crowds , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[66]  Hong Liu,et al.  3D action recognition using data visualization and convolutional neural networks , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[67]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[68]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Jefersson Alex dos Santos,et al.  Activity Recognition Based on a Magnitude-Orientation Stream Network , 2017, 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[70]  PorikliFatih,et al.  Going deeper into action recognition , 2017 .

[71]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  William Robson Schwartz,et al.  Magnitude-Orientation Stream network and depth information applied to activity recognition , 2019, J. Vis. Commun. Image Represent..

[73]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[74]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[75]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks , 2018, Knowl. Based Syst..

[76]  Alexander C. Berg,et al.  Combining multiple sources of knowledge in deep CNNs for action recognition , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[77]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[78]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[79]  Thomas Brox,et al.  Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[80]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[81]  Scott Krig,et al.  Interest Point Detector and Feature Descriptor Survey , 2014 .

[82]  Chao Li,et al.  Skeleton-based action recognition with convolutional neural networks , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[83]  V. M. Zat︠s︡iorskiĭ Kinematics of human motion , 1998 .

[84]  Ying Wu,et al.  Human Action Recognition with Depth Cameras , 2014, SpringerBriefs in Computer Science.

[85]  William Robson Schwartz,et al.  Skeleton Image Representation for 3D Action Recognition Based on Tree Structure and Reference Joints , 2019, 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[86]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[88]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[89]  Somayeh Danafar,et al.  Action Recognition for Surveillance Applications Using Optic Flow and SVM , 2007, ACCV.

[90]  ZhangHao,et al.  Space-time representation of people based on 3D skeletal data , 2017 .

[91]  Cordelia Schmid,et al.  PoTion: Pose MoTion Representation for Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[92]  Jefersson Alex dos Santos,et al.  Optical Flow Co-occurrence Matrices: A novel spatiotemporal feature descriptor , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[93]  Xiaodong Yang,et al.  EigenJoints-based action recognition using Naïve-Bayes-Nearest-Neighbor , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[94]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[95]  Wageeh Boles,et al.  A suspicious behaviour detection using a context space model for smart surveillance systems , 2012, Comput. Vis. Image Underst..

[96]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[97]  Ling Shao,et al.  Spatio-Temporal Laplacian Pyramid Coding for Action Recognition , 2014, IEEE Transactions on Cybernetics.

[98]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[99]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[100]  Marcelo Bernardes Vieira,et al.  A tensor motion descriptor based on histograms of gradients and optical flow , 2014, Pattern Recognit. Lett..

[101]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[102]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[103]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[104]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[105]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[106]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[107]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[108]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[109]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[110]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[111]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[112]  Iain E. G. Richardson,et al.  H.264 and MPEG-4 Video Compression: Video Coding for Next-Generation Multimedia , 2003 .

[113]  Alberto Del Bimbo,et al.  Submitted to Ieee Transactions on Cybernetics 1 3d Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold , 2022 .

[114]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[115]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[116]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[117]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[118]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[119]  Ramakant Nevatia,et al.  Learning neighborhood cooccurrence statistics of sparse features for human activity recognition , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[120]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[121]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.