Deep Residual Feature Learning for Action Prediction

Though action recognition based on complete videos has achieved great success recently, action prediction remains a challenging task as the information provided by partial videos is not discriminative enough for classifying actions. In this paper, we propose a Deep Residual Feature Learning (DeepRFL) framework to explore more discriminative information from partial videos, achieving similar representations as those of complete videos. The proposed method is based on residual learning, which captures the salient differences between partial videos and their corresponding full videos. The partial videos can attain the missing information by learning from features of complete videos and thus improve the discriminative power. Moreover, our model can be trained efficiently in an end-to-end fashion. Extensive evaluations on the challenging UCF101 and HMDB51 datasets demonstrate that the proposed method outperforms state-of-the-art results.

[1]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[2]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Wei-Shi Zheng,et al.  Global-Local Temporal Saliency Action Prediction , 2017, IEEE Transactions on Image Processing.

[5]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[6]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jun Miao,et al.  Activity Auto-Completion: Predicting Human Activities from Partial Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[9]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[10]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Yunde Jia,et al.  Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[12]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[14]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[15]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[16]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[25]  Bin Sun,et al.  Action Prediction From Videos via Memorizing Hard-to-Predict Samples , 2018, AAAI.

[26]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[27]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.