Review of Video Predictive Understanding: Early Action Recognition and Future Action Prediction

Video predictive understanding encompasses a wide range of efforts that are concerned with the anticipation of the unobserved future from the current as well as historical video observations. Action prediction is a major sub-area of video predictive understanding and is the focus of this review. This sub-area has two major subdivisions: early action recognition and future action prediction. Early action recognition is concerned with recognizing an ongoing action as soon as possible. Future action prediction is concerned with the anticipation of actions that follow those previously observed. In either case, the causal relationship between the past, current and potential future information is the main focus. Various mathematical tools such as Markov Chains, Gaussian Processes, Auto-Regressive modeling and Bayesian recursive filtering are widely adopted jointly with computer vision techniques for these two tasks. However, these approaches face challenges such as the curse of dimensionality, poor generalization and constraints from domain specific knowledge. Recently, structures that rely on deep convolutional neural networks and recurrent neural networks have been extensively proposed for improving performance of existing vision tasks, in general, and action prediction tasks, in particular. However, they have their own shortcomings, e.g., reliance on massive training data and lack of strong theoretical underpinnings. In this survey, we start by introducing the major sub-areas of the broad area of video predictive understanding, which recently have received intensive attention and proven to have practical value. Next, a thorough review of various early action recognition and future action prediction algorithms are provided with suitably organized divisions. Finally, we conclude our discussion with future research directions.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Nicolas Padoy,et al.  Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream Retrieval , 2020, ArXiv.

[3]  Haroon Idrees,et al.  Predicting the Where and What of Actors and Actions through Online Action Localization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yun Fu,et al.  Modeling Complex Temporal Composition of Actionlets for Activity Prediction , 2012, ECCV.

[5]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[6]  Marco Pavone,et al.  Trajectron++: Multi-Agent Generative Trajectory Forecasting With Heterogeneous Data for Control , 2020, ArXiv.

[7]  Jianbo Shi,et al.  Predicting Behaviors of Basketball Players from First Person Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Juan Carlos Niebles,et al.  Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Michael S. Ryoo,et al.  Adversarial Generative Grammars for Human Activity Prediction , 2020, ECCV.

[11]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Giovanni Maria Farinella,et al.  Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation , 2018, ECCV Workshops.

[13]  Joanna Materzynska,et al.  The Jester Dataset: A Large-Scale Video Dataset of Human Gestures , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[14]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[15]  Abhinav Gupta,et al.  Object-centric Forward Modeling for Model Predictive Control , 2019, CoRL.

[16]  Silvio Savarese,et al.  SoPhie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Boyuan Chen,et al.  Oops! Predicting Unintentional Action in Video , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Richard P. Wildes,et al.  On Diverse Asynchronous Activity Anticipation , 2020, ECCV.

[19]  Sergio Escalera,et al.  Folded Recurrent Neural Networks for Future Video Prediction , 2017, ECCV.

[20]  Basura Fernando,et al.  Anticipating human actions by correlating past with the future with Jaccard similarity measures , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Amit K. Roy-Chowdhury,et al.  Joint Prediction of Activity Labels and Starting Times in Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Marco Pavone,et al.  The Trajectron: Probabilistic Multi-Agent Trajectory Modeling With Dynamic Spatiotemporal Graphs , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Nicholas Rhinehart,et al.  First-Person Activity Forecasting with Online Inverse Reinforcement Learning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[30]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[31]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[32]  Juan Carlos Niebles,et al.  Peeking Into the Future: Predicting Future Person Activities and Locations in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Bingbing Ni,et al.  Binary Coding for Partial Action Analysis with Limited Observation Ratios , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Jake K. Aggarwal,et al.  Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me? , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[36]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Wei Liu,et al.  Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamic , 2018, IJCAI.

[38]  Yutaka Satoh,et al.  Anticipating Traffic Accidents with Adaptive Loss and Large-Scale Incident DB , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Gang Wang,et al.  SSNet: Scale Selection Network for Online 3D Action Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Larry S. Davis,et al.  On Encoding Temporal Evolution for Real-time Action Prediction , 2017 .

[41]  Lars Petersson,et al.  A Stochastic Conditioning Scheme for Diverse Human Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Andrew Zisserman,et al.  Efficient Visual Search of Videos Cast as Text Retrieval , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Hema Swetha Koppula,et al.  Recurrent Neural Networks for driver activity anticipation via sensory-fusion architecture , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[45]  Shuicheng Yan,et al.  Predicting Scene Parsing and Motion Dynamics in the Future , 2017, NIPS.

[46]  Wei-Shi Zheng,et al.  Global-Local Temporal Saliency Action Prediction , 2017, IEEE Transactions on Image Processing.

[47]  Gang Yu,et al.  Discriminative Orderlet Mining for Real-Time Recognition of Human-Object Interaction , 2014, ACCV.

[48]  Bin Sun,et al.  Action Prediction From Videos via Memorizing Hard-to-Predict Samples , 2018, AAAI.

[49]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[50]  Yun Fu,et al.  Human Action Recognition and Prediction: A Survey , 2018, International Journal of Computer Vision.

[51]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Jake K. Aggarwal,et al.  An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[53]  Paul Lukowicz,et al.  Collecting complex activity datasets in highly rich networked sensor environments , 2010, 2010 Seventh International Conference on Networked Sensing Systems (INSS).

[54]  John K. Tsotsos,et al.  PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Xiulong Liu,et al.  PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Suman Saha,et al.  Predicting Action Tubes , 2018, ECCV Workshops.

[57]  Irfan A. Essa,et al.  A novel sequence representation for unsupervised analysis of human activities , 2009, Artif. Intell..

[58]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[60]  David W. Aha,et al.  Improving Offensive Performance Through Opponent Modeling , 2009, AIIDE.

[61]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[62]  Yann LeCun,et al.  Predicting Future Instance Segmentations by Forecasting Convolutional Features , 2018, ECCV.

[63]  Jiwen Lu,et al.  Part-Activated Deep Reinforcement Learning for Action Prediction , 2018, ECCV.

[64]  Bernt Schiele,et al.  A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[65]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[66]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[67]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[68]  José M. F. Moura,et al.  Adversarial Geometry-Aware Human Motion Prediction , 2018, ECCV.

[69]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Anirban Chakraborty,et al.  Context-Aware Activity Forecasting , 2014, ACCV.

[72]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[73]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[74]  Wentao Bao,et al.  Group Activity Prediction with Sequential Relational Anticipation Model , 2020, ECCV.

[75]  Gedas Bertasius,et al.  Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[76]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[77]  Vincent Lepetit,et al.  BRIEF: Binary Robust Independent Elementary Features , 2010, ECCV.

[78]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[79]  Juan Carlos Niebles,et al.  Visual Forecasting by Imitating Dynamics in Natural Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[80]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[81]  John K. Tsotsos,et al.  Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[82]  Richard P. Wildes,et al.  A Spatiotemporal Oriented Energy Network for Dynamic Texture Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[83]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[84]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[85]  Richard P. Wildes,et al.  Spatiotemporal Feature Residual Propagation for Action Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[86]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[87]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Kostas Daniilidis,et al.  Predicting the Future with Transformational States , 2018, ArXiv.

[89]  Ramakant Nevatia,et al.  RED: Reinforced Encoder-Decoder Networks for Action Anticipation , 2017, BMVC.

[90]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[91]  Yun Fu,et al.  Adversarial Action Prediction Networks , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[93]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[94]  John K. Tsotsos,et al.  Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs , 2020, BMVC.

[95]  Yazan Abu Farha,et al.  When will you do what? - Anticipating Temporal Occurrences of Activities , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[96]  Samy Bengio,et al.  Time-Dependent Representation for Neural Event Sequence Prediction , 2017, ICLR.

[97]  Bernt Schiele,et al.  Time-Conditioned Action Anticipation in One Shot , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Jian Yu,et al.  Prediction-CGAN: Human Action Prediction with Conditional Generative Adversarial Networks , 2019, ACM Multimedia.

[99]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[100]  Abhinav Gupta,et al.  Compositional Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[101]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[102]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[103]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[104]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Yun Fu,et al.  Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[106]  Jaesik Park,et al.  Future Video Synthesis With Object Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[107]  Yun Fu,et al.  Prediction of Human Activity by Discovering Temporal Sequence Patterns , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[108]  Qi Zhao,et al.  Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[109]  Sridha Sridharan,et al.  Predicting the Future: A Jointly Learnt Model for Action Anticipation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[110]  Andrew Zisserman,et al.  Deep Insights into Convolutional Networks for Video Recognition , 2019, International Journal of Computer Vision.

[111]  Giovanni Maria Farinella,et al.  Next-active-object prediction from egocentric videos , 2017, J. Vis. Commun. Image Represent..

[112]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[113]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[114]  Yunde Jia,et al.  Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[115]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[116]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[117]  Yazan Abu Farha,et al.  Long-Term Anticipation of Activities with Cycle Consistency , 2020, GCPR.

[118]  Giovanni Maria Farinella,et al.  What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[119]  Suman Saha,et al.  Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[120]  Jiebo Luo,et al.  Confidence-Guided Self Refinement for Action Prediction in Untrimmed Videos , 2020, IEEE Transactions on Image Processing.

[121]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[122]  Ruben Villegas,et al.  High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks , 2019, NeurIPS.

[123]  Yann LeCun,et al.  Predicting Deeper into the Future of Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[124]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[125]  Stuart Geman,et al.  Markov Random Field Image Models and Their Applications to Computer Vision , 2010 .

[126]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[127]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[128]  Thomas Serre,et al.  An end-to-end generative framework for video segmentation and recognition , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[129]  Bingbing Ni,et al.  Egocentric Activity Prediction via Event Modulated Attention , 2018, ECCV.

[130]  Cordelia Schmid,et al.  Relational Action Forecasting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[131]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[132]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[133]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[134]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[135]  Andrea Vedaldi,et al.  Transactions on Pattern Analysis and Machine Intelligence 1 Action Recognition with Dynamic Image Networks , 2022 .

[136]  Ivan Laptev,et al.  Leveraging the Present to Anticipate the Future in Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[137]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[138]  James M. Rehg,et al.  Forecasting Human Object Interaction: Joint Prediction of Motor Attention and Egocentric Activity , 2019, ArXiv.

[139]  Yunde Jia,et al.  Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[140]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[141]  Carl Vondrick,et al.  Learning the Predictability of the Future , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[142]  Jingyi Hou,et al.  Spatial–Temporal Relation Reasoning for Action Prediction in Videos , 2021, International Journal of Computer Vision.

[143]  Juan Carlos Niebles,et al.  Imitation Learning for Human Pose Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[144]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[145]  Andrew Zisserman,et al.  Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[146]  Jiaying Liu,et al.  PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding , 2017, ArXiv.

[147]  Qi Wang,et al.  Early Action Prediction With Generative Adversarial Networks , 2019, IEEE Access.

[148]  Xiaolong Zhu,et al.  Pixel-Level Hand Detection with Shape-Aware Structured Forests , 2014, ACCV.

[149]  Nassir Navab,et al.  Human Motion Analysis with Deep Metric Learning , 2018, ECCV.

[150]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[151]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[152]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[153]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[154]  Jiawei He,et al.  A Variational Auto-Encoder Model for Stochastic Point Processes , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[155]  James M. Rehg,et al.  In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video , 2018, ECCV.

[156]  Gang Wang,et al.  Early Action Prediction by Soft Regression , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[157]  Hema Swetha Koppula,et al.  Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[158]  Haroon Idrees,et al.  Online Localization and Prediction of Actions and Interactions , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[159]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[160]  Amit K. Roy-Chowdhury,et al.  A poisson process model for activity forecasting , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[161]  Kris M. Kitani,et al.  Long-Term Activity Forecasting Using First-Person Vision , 2016, ACCV.

[162]  Richard Hartley,et al.  Action Anticipation with RBF Kernelized Feature Mapping RNN , 2018, ECCV.

[163]  Ian D. Reid,et al.  High Five: Recognising human interactions in TV shows , 2010, BMVC.

[164]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[165]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[166]  Abduallah A. Mohamed,et al.  Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[167]  Sridha Sridharan,et al.  Forecasting Future Action Sequences with Neural Memory Networks , 2019, BMVC.

[168]  Yazan Abu Farha,et al.  Uncertainty-Aware Anticipation of Activities , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[169]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[170]  Jianhuang Lai,et al.  Progressive Teacher-Student Learning for Early Action Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[171]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[172]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[173]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[174]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[175]  Zhengyou Zhang,et al.  Microsoft Kinect Sensor and Its Effect , 2012, IEEE Multim..

[176]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[177]  Andrew Zisserman,et al.  Future Event Prediction: If and When , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).