论文信息 - Human Action Recognition and Prediction: A Survey

Human Action Recognition and Prediction: A Survey

Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action executions, and action prediction to predict human actions (future state) based upon incomplete action executions. These two tasks have become particularly prevalent topics recently because of their explosively emerging real-world applications, such as visual surveillance, autonomous driving vehicle, entertainment, and video retrieval, etc. Many attempts have been devoted in the last a few decades in order to build a robust and effective framework for action recognition and prediction. In this paper, we survey the complete state-of-the-art techniques in the action recognition and prediction. Existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are also provided with systematic discussions.

Yun Fu | Yu Kong | Y. Fu | Yu Kong

[1] Jake K. Aggarwal,et al. Stochastic Representation and Recognition of High-Level Group Activities , 2011, International Journal of Computer Vision.

[2] Mehrtash Tafazzoli Harandi,et al. Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[3] Amit K. Roy-Chowdhury,et al. Continuous Learning of Human Activity Models Using Deep Nets , 2014, ECCV.

[4] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[5] Haibin Ling,et al. Modeling Geometric-Temporal Context With Directional Pyramid Co-Occurrence for Action Recognition , 2014, IEEE Transactions on Image Processing.

[6] Liang Wang,et al. Recognizing Human Activities from Silhouettes: Motion Subspace and Factorial Discriminative Graphical Model , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Du Tran,et al. Human Activity Recognition with Metric Learning , 2008, ECCV.

[8] Jake K. Aggarwal,et al. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9] Ian D. Reid,et al. High Five: Recognising human interactions in TV shows , 2010, BMVC.

[10] Hema Swetha Koppula,et al. Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[11] Cristian Sminchisescu,et al. Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[12] Cewu Lu,et al. Range-Sample Depth Feature for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Silvio Savarese,et al. Recognizing human actions by attributes , 2011, CVPR 2011.

[14] Barbara Caputo,et al. Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[15] Jian-Huang Lai,et al. Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] J. Decety,et al. Neural mechanisms subserving the perception of human actions , 1999, Trends in Cognitive Sciences.

[17] Silvio Savarese,et al. Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Zhouyu Fu,et al. Semantic-Based Surveillance Video Retrieval , 2007, IEEE Transactions on Image Processing.

[19] David A. Forsyth,et al. Searching Video for Complex Activities with Finite State Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Yunde Jia,et al. Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[21] Xiaodong Yang,et al. Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Jitendra Malik,et al. Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23] N. Troje,et al. Person identification from biological motion: Effects of structural and kinematic cues , 2005, Perception & psychophysics.

[24] Mohan M. Trivedi,et al. Trajectory Learning for Activity Understanding: Unsupervised, Multilevel, and Long-Term Adaptive Approach , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25] M. Keestra,et al. Understanding Human Action. Integraiting Meanings, Mechanisms, Causes, and Contexts , 2015 .

[26] Gaurav Sharma,et al. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[28] Fei-Fei Li,et al. Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29] Luc Van Gool,et al. Deep Temporal Linear Encoding Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] M. Goodale,et al. Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[31] Yun Fu,et al. Modeling Supporting Regions for Close Human Interaction Recognition , 2014, ECCV Workshops.

[32] Tae-Kyun Kim,et al. Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Nicu Sebe,et al. Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Christopher Joseph Pal,et al. Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35] Irfan A. Essa,et al. Gaussian process regression flow for analysis of motion trajectories , 2011, 2011 International Conference on Computer Vision.

[36] Ronen Basri,et al. Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[37] Chunfeng Yuan,et al. Multi-feature max-margin hierarchical Bayesian model for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[39] Ling Shao,et al. Learning Discriminative Representations from RGB-D Video Data , 2013, IJCAI.

[40] Richard Bowden,et al. Hollywood 3D: Recognizing Actions in 3D Natural Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Zicheng Liu,et al. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42] Rémi Ronfard,et al. Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[43] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Alex Pentland,et al. A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[45] Ying Wu,et al. Robust 3D Action Recognition with Random Occupancy Patterns , 2012, ECCV.

[46] Apostol Natsev,et al. YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[47] Ying Wu,et al. Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[48] Michael S. Ryoo,et al. Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[49] Jiebo Luo,et al. Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition , 2018, AAAI.

[50] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Jake K. Aggarwal,et al. Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[52] Yang Wang,et al. Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[54] Chunfeng Yuan,et al. Multi-task Sparse Learning with Beta Process Prior for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[55] C. Darwin. The Expression of the Emotions in Man and Animals , .

[56] Patrick Bouthemy,et al. Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[57] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58] R. Blake,et al. Perception of human motion. , 2007, Annual review of psychology.

[59] Bo Gao,et al. A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[60] Fei-Fei Li,et al. Action Recognition with Exemplar Based 2.5D Graph Matching , 2012, ECCV.

[61] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62] Bingbing Ni,et al. RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[63] Xiaogang Wang,et al. Random field topic model for semantic region analysis in crowded scenes from tracklets , 2011, CVPR 2011.

[64] Siddhartha S. Srinivasa,et al. Manipulation planning with goal sets using constrained trajectory optimization , 2011, 2011 IEEE International Conference on Robotics and Automation.

[65] Martial Hebert,et al. Activity Forecasting , 2012, ECCV.

[66] N. Troje. Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. , 2002, Journal of vision.

[67] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68] Pietro Perona,et al. Hybrid models for human motion recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[69] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70] Bart Selman,et al. Human Activity Detection from RGBD Images , 2011, Plan, Activity, and Intent Recognition.

[71] Serge J. Belongie,et al. Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[72] Kris M. Kitani,et al. Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[73] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74] Ying Wu,et al. Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75] Cordelia Schmid,et al. Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos , 2018, ArXiv.

[76] Tae-Kyun Kim,et al. Real-time Action Recognition by Spatiotemporal Semantic and Structural Forests , 2010, BMVC.

[77] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78] Silvio Savarese,et al. Learning context for collective activity recognition , 2011, CVPR 2011.

[79] James W. Davis,et al. The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[80] Yun Fu,et al. A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[81] Lior Wolf,et al. Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[82] Juan Carlos Niebles,et al. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[83] Jake K. Aggarwal,et al. Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me? , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[84] Dmitry Berenson,et al. Goal Set Inverse Optimal Control and Iterative Replanning for Predicting Human Reaching Motions in Shared Workspaces , 2016, IEEE Transactions on Robotics.

[85] Florent Perronnin,et al. Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[86] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[87] Meng Wang,et al. 3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks , 2014, ACM Multimedia.

[88] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[89] Heng Wang,et al. SLAC: A Sparsely Labeled Dataset for Action Classification and Localization , 2017, ArXiv.

[90] James M. Rehg,et al. Movement Pattern Histogram for Action Recognition and Retrieval , 2014, ECCV.

[91] Philip H. S. Torr,et al. DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[92] Xi Wang,et al. Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[93] Martial Hebert,et al. Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[94] Siddhartha S. Srinivasa,et al. Planning-based prediction for pedestrians , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[95] Richard P. Wildes,et al. Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96] Berthold K. P. Horn,et al. Determining Optical Flow , 1981, Other Conferences.

[97] Sven J. Dickinson,et al. Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[98] Jake K. Aggarwal,et al. View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[99] Patrick Pérez,et al. Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[100] S. Maybank,et al. Fusing R Features and Local Features with Context-aware Kernels for Action Recognition , 2015 .

[101] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[102] Markus Wulfmeier,et al. Watch this: Scalable cost-function learning for path planning in urban environments , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[103] Silvio Savarese,et al. A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[104] Yann LeCun,et al. Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[105] G. Rizzolatti,et al. The mirror-neuron system. , 2004, Annual review of neuroscience.

[106] Mohsen Ramezani,et al. A review on human action analysis in videos for retrieval applications , 2016, Artificial Intelligence Review.

[107] Christopher G. Harris,et al. A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[108] Jiebo Luo,et al. Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[109] Jiajun Wu,et al. Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[110] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[111] Quoc V. Le,et al. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[112] Ruzena Bajcsy,et al. Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[113] Greg Mori,et al. Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[114] S. Sumi. Perception of a Point-Light Walker Produced by Eight Lights Attached to the Back of the Walker , 1997 .

[115] Takeo Kanade,et al. An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[116] Lin Sun,et al. DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[117] Cordelia Schmid,et al. Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[118] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[119] Svetha Venkatesh,et al. Activity recognition and abnormality detection with the switching hidden semi-Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[120] Gonen Eren,et al. Evaluation of video activity localizations integrating quality and quantity measurements , 2014, Comput. Vis. Image Underst..

[121] G. Rizzolatti,et al. The functional role of the parieto-frontal mirror circuit: interpretations and misinterpretations , 2010, Nature Reviews Neuroscience.

[122] Michael J. Black,et al. Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[123] Mubarak Shah,et al. Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[124] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[125] Cordelia Schmid,et al. Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[126] Hema Swetha Koppula,et al. Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[127] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[128] Juan Carlos Niebles,et al. A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[129] Hairong Qi,et al. Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps , 2013, 2013 IEEE International Conference on Computer Vision.

[130] Dariu Gavrila,et al. Context-Based Pedestrian Path Prediction , 2014, ECCV.

[131] Martial Hebert,et al. Modeling the Temporal Extent of Actions , 2010, ECCV.

[132] Silvio Savarese,et al. A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[133] Ronald Poppe,et al. A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[134] Hema Swetha Koppula,et al. Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[135] Kris M. Kitani,et al. Predicting wide receiver trajectories in American football , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[136] Wei-Shi Zheng,et al. Global-Local Temporal Saliency Action Prediction , 2017, IEEE Transactions on Image Processing.

[137] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[138] Bo Zhang,et al. Forecast the Plausible Paths in Crowd Scenes , 2017, IJCAI.

[139] Mubarak Shah,et al. Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[140] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[141] Stan Sclaroff,et al. Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[142] Abhinav Gupta,et al. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[143] Z. Liu,et al. A real time system for dynamic hand gesture recognition with a depth sensor , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[144] Trevor Darrell,et al. Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[145] Juan Carlos Niebles,et al. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[146] Stephen J. Maybank,et al. Learning Human Actions by Combining Global Dynamics and Local Appearance , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[147] Yunde Jia,et al. Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[148] Mubarak Shah,et al. Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[149] Fei-Fei Li,et al. Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[150] Fei-Fei Li,et al. Shifting Weights: Adapting Object Detectors from Image to Video , 2012, NIPS.

[151] Chunfeng Yuan,et al. Human Action Recognition Based on Context-Dependent Graph Kernels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[152] Li Wang,et al. Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[153] Jintao Li,et al. Hierarchical spatio-temporal context modeling for action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[154] Fei-Fei Li,et al. Social Role Discovery in Human Events , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[155] Zhengming Ding,et al. Latent Tensor Transfer Learning for RGB-D Action Recognition , 2014, ACM Multimedia.

[156] Todd Ingalls,et al. Real-time Gesture Recognition with Minimal Training Requirements and On-line Learning , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[157] Yang Wang,et al. Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[158] Luc Van Gool,et al. An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[159] Hossein Ragheb,et al. MuHAVi: A Multicamera Human Action Video Dataset for the Evaluation of Action Recognition Methods , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[160] Yun Fu,et al. Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[161] Silvio Savarese,et al. Knowledge Transfer for Scene-Specific Motion Prediction , 2016, ECCV.

[162] Shih-Fu Chang,et al. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[163] Yun Fu,et al. Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition , 2017, International Journal of Computer Vision.

[164] Ian D. Reid,et al. Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[165] Yun Fu,et al. Prediction of Human Activity by Discovering Temporal Sequence Patterns , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[166] Cordelia Schmid,et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[167] Jing Xiao,et al. Substructure and boundary modeling for continuous action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[168] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[169] Andrew Blake,et al. Efficient Human Pose Estimation from Single Depth Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[170] Yun Fu,et al. Modeling Complex Temporal Composition of Actionlets for Activity Prediction , 2012, ECCV.

[171] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[172] Bart Selman,et al. Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[173] Leonid Sigal,et al. Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[174] Tal Hassner,et al. The Action Similarity Labeling Challenge , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[175] Hong-Yuan Mark Liao,et al. Depth and Skeleton Associated Action Recognition without Online Accessible RGB-D Cameras , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[176] Jake K. Aggarwal,et al. Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[177] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[178] Trevor Darrell,et al. Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[179] Dong Xu,et al. Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[180] Haibin Ling,et al. 3D R Transform on Spatio-temporal Interest Points for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[181] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[182] T. J. Clarke,et al. The Perception of Emotion from Body Movement in Point-Light Displays of Interpersonal Dialogue , 2005, Perception.

[183] Fernando De la Torre,et al. Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[184] Mubarak Shah,et al. Complex Events Detection Using Data-Driven Concepts , 2012, ECCV.

[185] David F. Fouhey,et al. Predicting Object Dynamics in Scenes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[186] Patrick Olivier,et al. Feature Learning for Activity Recognition in Ubiquitous Computing , 2011, IJCAI.

[187] Limin Wang,et al. Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[188] Yunde Jia,et al. Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[189] Ying Wu,et al. Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[190] Bin Sun,et al. Action Prediction From Videos via Memorizing Hard-to-Predict Samples , 2018, AAAI.

[191] Mubarak Shah,et al. A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[192] Cordelia Schmid,et al. A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[193] David A. McAllester,et al. A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[194] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[195] Yang Wang,et al. Learning a discriminative hidden part model for human action recognition , 2008, NIPS.

[196] Wanqing Li,et al. Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[197] Fei-Fei Li,et al. Socially-Aware Large-Scale Crowd Forecasting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[198] S. Gong,et al. Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[199] Silvio Savarese,et al. Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[200] Yun Fu,et al. Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[201] Anthony Hoogs,et al. Unsupervised Learning of Functional Categories in Video Scenes , 2010, ECCV.

[202] Silvio Savarese,et al. What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[203] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[204] Stefano Soatto,et al. Tracklet Descriptors for Action Modeling and Video Analysis , 2010, ECCV.

[205] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[206] Yun Fu,et al. Bilinear heterogeneous information machine for RGB-D action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[207] Xiaohua Zhai,et al. Cross-media retrieval by intra-media and inter-media correlation mining , 2013, Multimedia Systems.

[208] Gang Yu,et al. Discriminative Orderlet Mining for Real-Time Recognition of Human-Object Interaction , 2014, ACCV.

[209] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.