Confidence-Guided Self Refinement for Action Prediction in Untrimmed Videos

Many existing methods formulate the action prediction task as recognizing early parts of actions in trimmed videos. In this paper, we focus on predicting actions from ongoing untrimmed videos where actions might not happen at the very beginning of videos. It is extremely challenging to predict actions in such untrimmed videos due to ambiguous or even no information of actions in the early parts of videos. To address this problem, we propose a prediction confidence that assesses the decision quality of a prediction model. Guided by the confidence, the model continuously refines the prediction results by itself with the increasing observed video frames. Specifically, we build a Self Prediction Refining Network (SPR-Net) which incrementally learns the confidence for action prediction. SPR-Net consists of three modules: a temporal hybrid network, an incremental confidence learner, and a self-refining Gumbel softmax sampler. The temporal hybrid network generates the action category distributions by integrating static scene and dynamic motion information. The incremental confidence learner calculates the confidence in an incremental manner, judging the extent to which the temporal hybrid network should believe its prediction result. The self-refining Gumbel softmax sampler models the mutual relationship between the prediction confidence and the category distribution, which enables them to be jointly learned in an end-to-end fashion. We also present a sparse self-attention mechanism to encode local spatio-temporal features into the frame-level motion representation to further improve the prediction performance. Extensive experiments on five datasets (i.e., UT-Interaction, BIT-Interaction, UCF101, THUMOS14, and ActivityNet) validate the effectiveness of the proposed method.

[1]  Anirban Chakraborty,et al.  Context-Aware Activity Forecasting , 2014, ACCV.

[2]  Hema Swetha Koppula,et al.  Recurrent Neural Networks for driver activity anticipation via sensory-fusion architecture , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Haroon Idrees,et al.  Predicting the Where and What of Actors and Actions through Online Action Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Bingbing Ni,et al.  Multiple Granularity Group Interaction Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Wei-Shi Zheng,et al.  Action Knowledge Transfer for Action Prediction with Partial Videos , 2019, AAAI.

[6]  Amit K. Roy-Chowdhury,et al.  Joint Prediction of Activity Labels and Starting Times in Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Kuk-Jin Yoon,et al.  Robust Online Multi-object Tracking Based on Tracklet Confidence and Online Discriminative Appearance Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Chuang Gan,et al.  End-to-End Learning of Motion Representation for Video Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Stanislas Dehaene,et al.  Decoding the Dynamics of Action, Intention, and Error Detection for Conscious and Subliminal Stimuli , 2014, The Journal of Neuroscience.

[10]  David A. Leopold,et al.  Blindsight depends on the lateral geniculate nucleus , 2010, Nature.

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  S. Bonaccio,et al.  Advice taking and decision-making: An integrative literature review, and implications for the organizational sciences , 2006 .

[13]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Suman Saha,et al.  Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Gaurav Sharma,et al.  Vehicle Tracking in Wide Area Motion Imagery via Stochastic Progressive Association Across Multiple Frames , 2017, IEEE Transactions on Image Processing.

[16]  Gang Yu,et al.  Predicting human activities using spatio-temporal structure of interest points , 2012, ACM Multimedia.

[17]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[19]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[20]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[21]  Hatim A. Zariwala,et al.  Neural correlates, computation and behavioural impact of decision confidence , 2008, Nature.

[22]  Shih-Fu Chang,et al.  Online Detection of Action Start in Untrimmed, Streaming Videos , 2018, ECCV.

[23]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[24]  Haroon Idrees,et al.  Online Localization and Prediction of Actions and Interactions , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Yu Qiao,et al.  Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos , 2018, IEEE Transactions on Image Processing.

[26]  Luc Van Gool,et al.  Cascaded Confidence Filtering for Improved Tracking-by-Detection , 2010, ECCV.

[27]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[28]  Mohammed Bennamoun,et al.  Leveraging Structural Context Models and Ranking Score Fusion for Human Interaction Prediction , 2018, IEEE Transactions on Multimedia.

[29]  Nick Yeung,et al.  Subjective Confidence Predicts Information Seeking in Decision Making , 2018, Psychological science.

[30]  Yunde Jia,et al.  Content-Attention Representation by Factorized Action-Scene Network for Action Recognition , 2018, IEEE Transactions on Multimedia.

[31]  Richard P. Wildes,et al.  Spatiotemporal Feature Residual Propagation for Action Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[33]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[35]  Larry S. Davis,et al.  On Encoding Temporal Evolution for Real-time Action Prediction , 2017 .

[36]  Yun Fu,et al.  Adversarial Action Prediction Networks , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Yunhao Liu,et al.  Quality of Trilateration: Confidence-Based Iterative Localization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[38]  Yunde Jia,et al.  A Hierarchical Video Description for Complex Activity Understanding , 2016, International Journal of Computer Vision.

[39]  Luc Van Gool,et al.  Robust tracking-by-detection using a detector confidence particle filter , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[40]  Yi Yang,et al.  You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  P. Latham,et al.  References and Notes Supporting Online Material Materials and Methods Figs. S1 to S11 References Movie S1 Optimally Interacting Minds R�ports , 2022 .

[42]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[43]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Ming-Hsuan Yang,et al.  Flow-Grounded Spatial-Temporal Video Prediction from Still Images , 2018, ECCV.

[45]  Gang Wang,et al.  SSNet: Scale Selection Network for Online 3D Action Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Timothy J. Pleskac,et al.  Two-stage dynamic signal detection: a theory of choice, decision time, and confidence. , 2010, Psychological review.

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Bernt Schiele,et al.  Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Wei-Shi Zheng,et al.  Global-Local Temporal Saliency Action Prediction , 2017, IEEE Transactions on Image Processing.

[50]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[53]  Roger Ratcliff,et al.  Aging and Confidence Judgments in Item Recognition , 2018, Journal of experimental psychology. Learning, memory, and cognition.

[54]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  V. de Gardelle,et al.  The Impact of Evidence Reliability on Sensitivity and Bias in Decision Confidence , 2017, Journal of experimental psychology. Human perception and performance.

[56]  Yun Fu,et al.  Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Ramakant Nevatia,et al.  RED: Reinforced Encoder-Decoder Networks for Action Anticipation , 2017, BMVC.

[59]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Gang Wang,et al.  Skeleton-Based Online Action Prediction Using Scale Selection Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[63]  Changsheng Xu,et al.  Max-Confidence Boosting With Uncertainty for Visual Tracking , 2015, IEEE Transactions on Image Processing.

[64]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Zhiting Hu,et al.  Improved Variational Autoencoders for Text Modeling using Dilated Convolutions , 2017, ICML.

[66]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[67]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[68]  Gang Wang,et al.  Early Action Prediction by Soft Regression , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Bin Sun,et al.  Action Prediction From Videos via Memorizing Hard-to-Predict Samples , 2018, AAAI.

[70]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[71]  Nathaniel D. Daw,et al.  Self-Evaluation of Decision-Making: A General Bayesian Framework for Metacognitive Computation , 2017, Psychological review.

[72]  Mohammed Bennamoun,et al.  Learning Latent Global Network for Skeleton-Based Action Prediction , 2020, IEEE Transactions on Image Processing.

[73]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[74]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[75]  Cees Snoek,et al.  Online Action Detection , 2016, ECCV.

[76]  Hakwan Lau,et al.  There are things that we know that we know, and there are things that we do not know we do not know: Confidence in decision-making , 2015, Neuroscience & Biobehavioral Reviews.

[77]  Jianhuang Lai,et al.  Progressive Teacher-Student Learning for Early Action Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[79]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[80]  Yun Fu,et al.  Prediction of Human Activity by Discovering Temporal Sequence Patterns , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[82]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Luc Van Gool,et al.  Online Multiperson Tracking-by-Detection from a Single, Uncalibrated Camera , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.