ConvGRU in Fine-grained Pitching Action Recognition for Action Outcome Prediction

Prediction of the action outcome is a new challenge for a robot collaboratively working with humans. With the impressive progress in video action recognition in recent years, fine-grained action recognition from video data turns into a new concern. Fine-grained action recognition detects subtle differences of actions in more specific granularity and is significant in many fields such as human-robot interaction, intelligent traffic management, sports training, health caring. Considering that the different outcomes are closely connected to the subtle differences in actions, fine-grained action recognition is a practical method for action outcome prediction. In this paper, we explore the performance of convolutional gate recurrent unit (ConvGRU) method on a fine-grained action recognition tasks: predicting outcomes of ball-pitching. Based on sequences of RGB images of human actions, the proposed approach achieved the performance of 79.17% accuracy, which exceeds the current state-of-the-art result. We also compared different network implementations and showed the influence of different image sampling methods, different fusion methods and pre-training, etc. Finally, we discussed the advantages and limitations of ConvGRU in such action outcome prediction and fine-grained action recognition tasks.

[1]  Yutaka Matsuo,et al.  Mining fine-grained opinions on closed captions of YouTube videos with an attention-RNN , 2017, WASSA@EMNLP.

[2]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[3]  Hao Yang,et al.  SCNN: Sequential convolutional neural network for human action recognition in videos , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[4]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[6]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Bo Hu,et al.  Prediction of interaction intention based on eye movement gaze feature , 2019, 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC).

[9]  Vittorio Murino,et al.  Predicting Human Intentions from Motion Cues Only: A 2D+3D Fusion Approach , 2017, ACM Multimedia.

[10]  Hema Swetha Koppula,et al.  Brain4Cars: Car That Knows Before You Do via Sensory-Fusion Deep Learning Architecture , 2016, ArXiv.

[11]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ivan Markovic,et al.  Human Intention Estimation based on Hidden Markov Model Motion Validation for Safe Flexible Robotized Warehouses , 2018, Robotics and Computer-Integrated Manufacturing.

[13]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Ardhendu Behera,et al.  Context-driven Multi-stream LSTM (M-LSTM) for Recognizing Fine-Grained Activity of Drivers , 2018, GCPR.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Rainer Stiefelhagen,et al.  Drive&Act: A Multi-Modal Dataset for Fine-Grained Driver Behavior Recognition in Autonomous Vehicles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Xiaoshuai Sun,et al.  Two-Stream 3-D convNet Fusion for Action Recognition in Videos With Arbitrary Size and Length , 2018, IEEE Transactions on Multimedia.

[18]  Lin Zhang,et al.  Improving Human Intention Prediction Using Data Augmentation , 2018, 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN).

[19]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Lin Zhang,et al.  Deep-Learning-Based Human Intention Prediction Using RGB Images and Optical Flow , 2020, J. Intell. Robotic Syst..

[21]  F. Yang,et al.  Two-Stream Convolutional Network for Improving Activity Recognition Using Convolutional Long Short-Term Memory Networks , 2019, IEEE Access.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Luzheng Bi,et al.  EEG-based emergency braking intention prediction for brain-controlled driving considering one electrode falling-off , 2017, 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[24]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[25]  Pietro Perona,et al.  Fine-grained classification of pedestrians in video: Benchmark and state of the art , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[28]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[29]  Yue Zhao,et al.  FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Xinwei Xue,et al.  Fine-Grained Action Recognition on a Novel Basketball Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Chih-Chieh Yang,et al.  Video Action Recognition With an Additional End-to-End Trained Temporal Stream , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[35]  Alistair A. Young,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2017, MICCAI 2017.

[36]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[37]  Sudeep Sarkar,et al.  A Perceptual Prediction Framework for Self Supervised Event Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Yonghong Song,et al.  Spatial Mask ConvLSTM Network and Intra-Class Joint Training Method for Human Action Recognition in Video , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[39]  Dongsuk Kum,et al.  Predictive Cruise Control Using Radial Basis Function Network-Based Vehicle Motion Prediction and Chance Constrained Model Predictive Control , 2019, IEEE Transactions on Intelligent Transportation Systems.

[40]  Cuntai Guan,et al.  A review on EMG-based motor intention prediction of continuous human upper limb motion for human-robot collaboration , 2019, Biomed. Signal Process. Control..

[41]  Rui Li,et al.  Human Intention Prediction in Human-Robot Collaborative Tasks , 2018, HRI.

[42]  Garimella Rama Murthy,et al.  A Novel Framework for Fine Grained Action Recognition in Soccer , 2019, IWANN.

[43]  Omar ElHarrouss,et al.  A Novel Approach for Robust Multi Human Action Detection and Recognition based on 3-Dimentional Convolutional Neural Networks , 2019, ArXiv.

[44]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yibin Li,et al.  Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos , 2018, Pattern Recognit..

[46]  James A. Reggia,et al.  Robust human action recognition via long short-term memory , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[47]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[48]  Hilde Kuehne,et al.  Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data , 2019, ArXiv.

[49]  Qin Lin,et al.  Lane-Change Intention Estimation for Car-Following Control in Autonomous Driving , 2018, IEEE Transactions on Intelligent Vehicles.

[50]  Juan Song,et al.  Multimodal Gesture Recognition Using 3-D Convolution and Convolutional LSTM , 2017, IEEE Access.

[51]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[52]  Bernt Schiele,et al.  Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data , 2015, International Journal of Computer Vision.