论文信息 - Detecting the Moment of Completion: Temporal Models for Localising Action Completion

Detecting the Moment of Completion: Temporal Models for Localising Action Completion

Action completion detection is the problem of modelling the action's progression towards localising the moment of completion - when the action's goal is confidently considered achieved. In this work, we assess the ability of two temporal models, namely Hidden Markov Models (HMM) and Long-Short Term Memory (LSTM), to localise completion for six object interactions: switch, plug, open, pull, pick and drink. We use a supervised approach, where annotations of pre-completion and post-completion frames are available per action, and fine-tuned CNN features are used to train temporal models. Tested on the Action-Completion-2016 dataset, we detect completion within 10 frames of annotations for ~75% of completed action sequences using both temporal models. Results show that fine-tuned CNN features outperform hand-crafted features for localisation, and that observing incomplete instances is necessary when incomplete sequences are also present in the test set.

[1] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[2] Lynne E. Parker,et al. Simplex-Based 3D Spatio-temporal Feature Description for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Cordelia Schmid,et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[4] Martial Hebert,et al. Trajectons: Action recognition through the motion analysis of tracked features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[5] David A. Forsyth,et al. Automatic Annotation of Everyday Movements , 2003, NIPS.

[6] Rémi Ronfard,et al. Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[7] Matti Pietikäinen,et al. Recognition of human actions using texture descriptors , 2011, Machine Vision and Applications.

[8] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[10] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11] Alex Pentland,et al. Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12] Ying Wu,et al. Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Li Fei-Fei,et al. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[14] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Ali Farhadi,et al. Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Andrew Gilbert,et al. Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-temporal Corners , 2008, ECCV.

[18] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[19] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[20] Juan Carlos Niebles,et al. Unsupervised Learning of Human Action Categories , 2006 .

[21] Guo-Jun Qi,et al. Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22] Alex Pentland,et al. A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[23] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[24] Jake K. Aggarwal,et al. Human action recognition with extremities as semantic posture representation , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[25] Christian Wolf,et al. Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks , 2010, ICANN.

[26] James A. Reggia,et al. Robust human action recognition via long short-term memory , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[27] Richard P. Wildes,et al. Review of Action Recognition and Detection Methods , 2016, ArXiv.

[28] Stan Sclaroff,et al. Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Jun Zhang,et al. Continuous action segmentation and recognition using hybrid convolutional neural network-hidden Markov model model , 2016, IET Comput. Vis..

[31] Majid Mirmehdi,et al. Beyond Action Recognition: Action Completion in RGB-D Data , 2016, BMVC.