Human Action Forecasting by Learning Task Grammars

For effective human-robot interaction, it is important that a robotic assistant can forecast the next action a human will consider in a given task. Unfortunately, real-world tasks are often very long, complex, and repetitive; as a result forecasting is not trivial. In this paper, we propose a novel deep recurrent architecture that takes as input features from a two-stream Residual action recognition framework, and learns to estimate the progress of human activities from video sequences -- this surrogate progress estimation task implicitly learns a temporal task grammar with respect to which activities can be localized and forecasted. To learn the task grammar, we propose a stacked LSTM based multi-granularity progress estimation framework that uses a novel cumulative Euclidean loss as objective. To demonstrate the effectiveness of our proposed architecture, we showcase experiments on two challenging robotic assistive tasks, namely (i) assembling an Ikea table from its constituents, and (ii) changing the tires of a car. Our results demonstrate that learning task grammars offers highly discriminative cues improving the forecasting accuracy by more than 9% over the baseline two-stream forecasting model, while also outperforming other competitive schemes.

[1]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[4]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[5]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[6]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[8]  Anoop Cherian,et al.  Human Pose Forecasting via Deep Markov Models , 2017, 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[9]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[10]  Alex Pentland,et al.  Modeling and Prediction of Human Behavior , 1999, Neural Computation.

[11]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[12]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ian D. Reid,et al.  Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yun Fu,et al.  Prediction of Human Activity by Discovering Temporal Sequence Patterns , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Dong-Gyu Lee,et al.  Human activity prediction based on Sub-volume Relationship Descriptor , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[18]  Yun Fu,et al.  Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Greg Mori,et al.  Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Gordon Cheng,et al.  Online prediction of activities with structure: Exploiting contextual associations and sequences , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[24]  Anirban Chakraborty,et al.  Context-Aware Activity Forecasting , 2014, ACCV.

[25]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[26]  Alberto Del Bimbo,et al.  Am I Done? Predicting Action Progress in Videos , 2017, ACM Trans. Multim. Comput. Commun. Appl..