Action prediction via deep residual feature learning and weighted loss

Action prediction based on partially observed videos is challenging as the information provided by partial videos is not discriminative enough for classification. In this paper, we propose a Deep Residual Feature Learning (DeepRFL) framework to explore more discriminative information from partial videos, achieving similar representations as those of complete videos. The whole framework performs as a teacher-student network, where the teacher network supports the complete video feature supervision to the student network to capture the salient differences between partial videos and their corresponding complete videos based on the residual feature learning. The teacher and student network are trained simultaneously, and the technique called partial feature detach is employed to prevent the teacher network from disturbing by the student network. We also design a novel weighted loss function to give less penalization to partial videos that have small observation ratios. Extensive evaluations on the challenging UCF101 and HMDB51 datasets demonstrate that the proposed method outperforms state-of-the-art results without knowing the observation ratios of testing videos. The code will be publicly available soon.

[1]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jun Zhong,et al.  Towards unsupervised physical activity recognition using smartphone accelerometers , 2016, Multimedia Tools and Applications.

[4]  Wei-Shi Zheng,et al.  Global-Local Temporal Saliency Action Prediction , 2017, IEEE Transactions on Image Processing.

[5]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[6]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Luming Zhang,et al.  Action2Activity: Recognizing Complex Activities from Sensor Data , 2015, IJCAI.

[10]  K. Grauman,et al.  Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental updates , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Jun Miao,et al.  Activity Auto-Completion: Predicting Human Activities from Partial Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Lijuan Duan,et al.  Deep Residual Feature Learning for Action Prediction , 2018, 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM).

[13]  Kristen Grauman,et al.  Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental updates , 2009, CVPR.

[14]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Qi Tian,et al.  Enhancing Micro-video Understanding by Harnessing External Sounds , 2017, ACM Multimedia.

[16]  Bin Sun,et al.  Action Prediction From Videos via Memorizing Hard-to-Predict Samples , 2018, AAAI.

[17]  Shih-Fu Chang,et al.  Online Detection of Action Start in Untrimmed, Streaming Videos , 2018, ECCV.

[18]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[19]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[21]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xiao Liu,et al.  StNet: Local and Global Spatial-Temporal Modeling for Action Recognition , 2018, AAAI.

[26]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Vanja Josifovski,et al.  Up next: retrieval methods for large scale related video suggestion , 2014, KDD.

[28]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[29]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[30]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[34]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[36]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[37]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[38]  Yun Fu,et al.  Adversarial Action Prediction Networks , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.