论文信息 - Late Fusion of Bayesian and Convolutional Models for Action Recognition

Late Fusion of Bayesian and Convolutional Models for Action Recognition

The activities we do in our daily-life are generally carried out as a succession of atomic actions, following a logical order. During a video sequence, actions usually follow a logical order. In this paper, we propose a hybrid approach resulting from the fusion of a deep learning neural network with a Bayesian-based approach. The latter models human-object interactions and transition between actions. The key idea is to combine both approaches in the final prediction. We validate our strategy in two public datasets: CAD-120 and Watch-n-Patch. We show that our fusion approach yields performance gains in accuracy of respectively +4% and +6% over a baseline approach. Temporal action recognition performances are clearly improved by the fusion, especially when classes are imbalanced.

Francisco Madrigal | Frédéric Lerasle | Camille Maurice

[1] Song-Chun Zhu,et al. Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[2] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3] Wu Liu,et al. T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition , 2018, AAAI.

[4] Li Feng,et al. Deep Learning for Fall Detection: Three-Dimensional CNN Combined With LSTM on Video Kinematic Data , 2019, IEEE Journal of Biomedical and Health Informatics.

[5] Yuanliu Liu,et al. Video-based emotion recognition using CNN-RNN and C3D hybrid networks , 2016, ICMI.

[6] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7] Francisco Madrigal,et al. A New Bayesian Modeling for 3D Human-Object Action Recognition , 2019, 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[8] Song-Chun Zhu,et al. A Generalized Earley Parser for Human Activity Parsing and Prediction , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[10] Tao Lei,et al. Action Recognition with 3D ConvNet-GRU Architecture , 2018, ICRCA '18.

[11] Christian Wolf,et al. Human Activity Recognition with Pose-driven Attention to RGB , 2018, BMVC.

[12] Ivan Laptev,et al. Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Larry S. Davis,et al. Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14] Silvio Savarese,et al. Watch-n-patch: Unsupervised understanding of actions and relations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Larry H. Matthies,et al. Pooled motion features for first-person videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Shih-Fu Chang,et al. ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[17] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Gang Wang,et al. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Hema Swetha Koppula,et al. Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[21] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[22] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23] Heng Tao Shen,et al. Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition , 2017, IEEE Signal Processing Letters.

[24] Rama Chellappa,et al. Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[25] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Silvio Savarese,et al. Watch-Bot: Unsupervised learning for reminding humans of forgotten actions , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[27] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[29] Adrian Hilton,et al. A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[30] Wanqing Li,et al. Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[31] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).