论文信息 - ConvGRU in Fine-grained Pitching Action Recognition for Action Outcome Prediction

ConvGRU in Fine-grained Pitching Action Recognition for Action Outcome Prediction

Prediction of the action outcome is a new challenge for a robot collaboratively working with humans. With the impressive progress in video action recognition in recent years, fine-grained action recognition from video data turns into a new concern. Fine-grained action recognition detects subtle differences of actions in more specific granularity and is significant in many fields such as human-robot interaction, intelligent traffic management, sports training, health caring. Considering that the different outcomes are closely connected to the subtle differences in actions, fine-grained action recognition is a practical method for action outcome prediction. In this paper, we explore the performance of convolutional gate recurrent unit (ConvGRU) method on a fine-grained action recognition tasks: predicting outcomes of ball-pitching. Based on sequences of RGB images of human actions, the proposed approach achieved the performance of 79.17% accuracy, which exceeds the current state-of-the-art result. We also compared different network implementations and showed the influence of different image sampling methods, different fusion methods and pre-training, etc. Finally, we discussed the advantages and limitations of ConvGRU in such action outcome prediction and fine-grained action recognition tasks.

[1] Yutaka Matsuo,et al. Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN , 2017, WASSA@EMNLP.

[2] Cordelia Schmid,et al. Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[3] Hao Yang,et al. SCNN: Sequential convolutional neural network for human action recognition in videos , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[4] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[6] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7] Abhinav Gupta,et al. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Bo Hu,et al. Prediction of interaction intention based on eye movement gaze feature , 2019, 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC).

[9] Vittorio Murino,et al. Predicting Human Intentions from Motion Cues Only: A 2D+3D Fusion Approach , 2017, ACM Multimedia.

[10] Hema Swetha Koppula,et al. Brain4Cars: Car That Knows Before You Do via Sensory-Fusion Deep Learning Architecture , 2016, ArXiv.

[11] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Ivan Markovic,et al. Human Intention Estimation based on Hidden Markov Model Motion Validation for Safe Flexible Robotized Warehouses , 2018, Robotics and Computer-Integrated Manufacturing.

[13] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14] Ardhendu Behera,et al. Context-driven Multi-stream LSTM (M-LSTM) for Recognizing Fine-Grained Activity of Drivers , 2018, GCPR.

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Rainer Stiefelhagen,et al. Drive&Act: A Multi-Modal Dataset for Fine-Grained Driver Behavior Recognition in Autonomous Vehicles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Xiaoshuai Sun,et al. Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length , 2018, IEEE Transactions on Multimedia.

[18] Lin Zhang,et al. Improving Human Intention Prediction Using Data Augmentation , 2018, 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[19] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Lin Zhang,et al. Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow , 2020, J. Intell. Robotic Syst..

[21] F. Yang,et al. Two-Stream Convolutional Network for Improving Activity Recognition Using Convolutional Long Short-Term Memory Networks , 2019, IEEE Access.

[22] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[23] Luzheng Bi,et al. EEG-based emergency braking intention prediction for brain-controlled driving considering one electrode falling-off , 2017, 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[24] Christopher Joseph Pal,et al. Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[25] Pietro Perona,et al. Fine-grained classification of pedestrians in video: Benchmark and state of the art , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Thomas Serre,et al. The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[28] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[29] Yue Zhao,et al. FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Xinwei Xue,et al. Fine-Grained Action Recognition on a Novel Basketball Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Chih-Chieh Yang,et al. Video Action Recognition With an Additional End-to-End Trained Temporal Stream , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33] Ming Shao,et al. A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[35] Alistair A. Young,et al. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2017, MICCAI 2017.

[36] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[37] Sudeep Sarkar,et al. A Perceptual Prediction Framework for Self Supervised Event Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Yonghong Song,et al. Spatial Mask ConvLSTM Network and Intra-Class Joint Training Method for Human Action Recognition in Video , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[39] Dongsuk Kum,et al. Predictive Cruise Control Using Radial Basis Function Network-Based Vehicle Motion Prediction and Chance Constrained Model Predictive Control , 2019, IEEE Transactions on Intelligent Transportation Systems.

[40] Cuntai Guan,et al. A review on EMG-based motor intention prediction of continuous human upper limb motion for human-robot collaboration , 2019, Biomed. Signal Process. Control..

[41] Rui Li,et al. Human Intention Prediction in Human-Robot Collaborative Tasks , 2018, HRI.

[42] Garimella Rama Murthy,et al. A Novel Framework for Fine Grained Action Recognition in Soccer , 2019, IWANN.

[43] Omar ElHarrouss,et al. A Novel Approach for Robust Multi Human Action Detection and Recognition based on 3-Dimentional Convolutional Neural Networks , 2019, ArXiv.

[44] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Yibin Li,et al. Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos , 2018, Pattern Recognit..

[46] James A. Reggia,et al. Robust human action recognition via long short-term memory , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[47] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[48] Hilde Kuehne,et al. Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data , 2019, ArXiv.

[49] Qin Lin,et al. Lane-Change Intention Estimation for Car-Following Control in Autonomous Driving , 2018, IEEE Transactions on Intelligent Vehicles.

[50] Juan Song,et al. Multimodal Gesture Recognition Using 3-D Convolution and Convolutional LSTM , 2017, IEEE Access.

[51] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[52] Bernt Schiele,et al. Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data , 2015, International Journal of Computer Vision.