Late Fusion of Bayesian and Convolutional Models for Action Recognition

The activities we do in our daily-life are generally carried out as a succession of atomic actions, following a logical order. During a video sequence, actions usually follow a logical order. In this paper, we propose a hybrid approach resulting from the fusion of a deep learning neural network with a Bayesian-based approach. The latter models human-object interactions and transition between actions. The key idea is to combine both approaches in the final prediction. We validate our strategy in two public datasets: CAD-120 and Watch-n-Patch. We show that our fusion approach yields performance gains in accuracy of respectively +4% and +6% over a baseline approach. Temporal action recognition performances are clearly improved by the fusion, especially when classes are imbalanced.

[1]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Wu Liu,et al.  T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition , 2018, AAAI.

[4]  Li Feng,et al.  Deep Learning for Fall Detection: Three-Dimensional CNN Combined With LSTM on Video Kinematic Data , 2019, IEEE Journal of Biomedical and Health Informatics.

[5]  Yuanliu Liu,et al.  Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Francisco Madrigal,et al.  A New Bayesian Modeling for 3D Human-Object Action Recognition , 2019, 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[8]  Song-Chun Zhu,et al.  A Generalized Earley Parser for Human Activity Parsing and Prediction , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[10]  Tao Lei,et al.  Action Recognition with 3D ConvNet-GRU Architecture , 2018, ICRCA '18.

[11]  Christian Wolf,et al.  Human Activity Recognition with Pose-driven Attention to RGB , 2018, BMVC.

[12]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Silvio Savarese,et al.  Watch-n-patch: Unsupervised understanding of actions and relations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Larry H. Matthies,et al.  Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[17]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[21]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[22]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Heng Tao Shen,et al.  Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition , 2017, IEEE Signal Processing Letters.

[24]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[25]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Silvio Savarese,et al.  Watch-Bot: Unsupervised learning for reminding humans of forgotten actions , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[29]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[30]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[31]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).