Semantic human activity recognition: A literature review

Abstract This paper presents an overview of state-of-the-art methods in activity recognition using semantic features. Unlike low-level features, semantic features describe inherent characteristics of activities. Therefore, semantics make the recognition task more reliable especially when the same actions look visually different due to the variety of action executions. We define a semantic space including the most popular semantic features of an action namely the human body (pose and poselet), attributes, related objects, and scene context. We present methods exploiting these semantic features to recognize activities from still images and video data as well as four groups of activities: atomic actions, people interactions, human–object interactions, and group activities. Furthermore, we provide potential applications of semantic approaches along with directions for future research.

[1]  Marie-Christine Jaulent,et al.  Object structure and action requirements: A compatibility model for functional recognition , 1991, Int. J. Intell. Syst..

[2]  Luc Van Gool,et al.  Full body tracking from multiple views using stochastic sampling , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  J. Sklansky,et al.  Segmentation of people in motion , 1991, Proceedings of the IEEE Workshop on Visual Motion.

[4]  Juan Carlos Augusto,et al.  A Hierarchical Human Activity Recognition Framework Based on Automated Reasoning , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[5]  Nuno Vasconcelos,et al.  Recognizing Activities by Attribute Dynamics , 2012, NIPS.

[6]  Dong Han,et al.  Selection and context for action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  Norimichi Ukita Iterative Action and Pose Recognition Using Global-and-Pose Features and Action-Specific Models , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[8]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Patrick Pérez,et al.  Joint pose estimation and action recognition in image graphs , 2011, 2011 18th IEEE International Conference on Image Processing.

[10]  Martial Hebert,et al.  Stacked Hierarchical Labeling , 2010, ECCV.

[11]  Luc Van Gool,et al.  Does Human Action Recognition Benefit from Pose Estimation? , 2011, BMVC.

[12]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[14]  Martin L. Griss,et al.  NuActiv: recognizing unseen new activities using semantic attribute-based learning , 2013, MobiSys '13.

[15]  Xue Li,et al.  Action recognition in still images using a combination of human pose and context information , 2012, 2012 19th IEEE International Conference on Image Processing.

[16]  Yasuo Kuniyoshi,et al.  A self-organizing neural model for context-based action recognition , 2003, First International IEEE EMBS Conference on Neural Engineering, 2003. Conference Proceedings..

[17]  Jun Liu,et al.  Uncertainty Reasoning Based Formal Framework for Big Video Data Understanding , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[18]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[19]  Ling Shao,et al.  Unsupervised Spectral Dual Assignment Clustering of Human Actions in Context , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Jake K. Aggarwal,et al.  Hierarchical Recognition of Human Activities Interacting with Objects , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Adrian Hilton,et al.  Spherical matching for temporal correspondence of non-rigid surfaces , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[22]  Charless C. Fowlkes,et al.  Discriminative models for static human-object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[23]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[24]  Michael Felsberg,et al.  Semantic Pyramids for Gender and Action Recognition , 2014, IEEE Transactions on Image Processing.

[25]  Rama Chellappa,et al.  Sparse dictionary-based representation and recognition of action attributes , 2011, 2011 International Conference on Computer Vision.

[26]  N. Kanwisher,et al.  The Human Body , 2001 .

[27]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Chris L. Baker,et al.  Action understanding as inverse planning , 2009, Cognition.

[29]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[30]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[31]  Yun Fu,et al.  ARMA-HMM: A new approach for early recognition of human activity , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[32]  Yunde Jia,et al.  Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[33]  Richard Bowden,et al.  Putting the pieces together: Connected Poselets for human pose estimation , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[34]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[36]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Jake K. Aggarwal,et al.  A hierarchical Bayesian network for event recognition of human actions and interactions , 2004, Multimedia Systems.

[38]  Alexandros André Chaaraoui,et al.  Silhouette-based human action recognition using sequences of key poses , 2013, Pattern Recognit. Lett..

[39]  Stefan Carlsson,et al.  Recognizing and Tracking Human Action , 2002, ECCV.

[40]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[41]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[42]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[43]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[44]  Martial Hebert,et al.  Representing Pairwise Spatial and Temporal Relations for Action Recognition , 2010, ECCV.

[45]  Thomas B. Moeslund,et al.  A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[46]  Lifang Wu,et al.  A poselet based key frame searching approach in sports training videos , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[47]  Deva Ramanan,et al.  Learning to parse images of articulated bodies , 2006, NIPS.

[48]  Michael Felsberg,et al.  Scale Coding Bag-of-Words for Action Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[49]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[50]  Mubarak Shah,et al.  Motion-based recognition a survey , 1995, Image Vis. Comput..

[51]  Cordelia Schmid,et al.  Will person detection help bag-of-features action recognition? , 2010 .

[52]  Xilin Chen,et al.  Activity recognition based on semantic spatial relation , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[53]  Jake K. Aggarwal,et al.  Semantic Representation and Recognition of Continued and Recursive Human Activities , 2009, International Journal of Computer Vision.

[54]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[55]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..

[56]  Ivan Laptev,et al.  Improving bag-of-features action recognition with non-local cues , 2010, BMVC.

[57]  Christian Bauckhage,et al.  Action recognition by learning discriminative key poses , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[58]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[59]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[61]  Kristen Grauman,et al.  Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Jake K. Aggarwal,et al.  Visually Interpreting the Motion of Objects in Space , 1981, Computer.

[63]  Martial Hebert,et al.  Co-inference for Multi-modal Scene Analysis , 2012, ECCV.

[64]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[65]  S. SubrahmanianV.,et al.  Machine Recognition of Human Activities , 2008 .

[66]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[67]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[68]  Yang Wang,et al.  Learning hierarchical poselets for human parsing , 2011, CVPR 2011.

[69]  Fernando De la Torre,et al.  Max-margin early event detectors , 2012, CVPR.

[70]  Alessio Del Bue,et al.  Temporal Poselets for Collective Activity Detection and Recognition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[71]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Jitendra Malik,et al.  Object detection using a max-margin Hough transform , 2009, CVPR.

[73]  Bo Gao,et al.  A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[74]  Fei-Fei Li,et al.  Action Recognition with Exemplar Based 2.5D Graph Matching , 2012, ECCV.

[75]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[76]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[77]  Christian Bauckhage,et al.  Temporal key poses for human action recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[78]  Francisco J. Perales,et al.  A system for human motion matching between synthetic and real images based on a biomechanic graphical model , 1994, Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects.

[79]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[80]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[81]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[83]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[84]  Rebecca F. Schwarzlose,et al.  Separate Face and Body Selectivity on the Fusiform Gyrus , 2005, The Journal of Neuroscience.

[85]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[86]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[87]  Jake K. Aggarwal,et al.  Human Motion Analysis: A Review , 1999, Comput. Vis. Image Underst..

[88]  Byoung-Tak Zhang,et al.  Enhancing human action recognition through spatio-temporal feature learning and semantic rules , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[89]  Jake K. Aggarwal,et al.  Human motion analysis: a review , 1997, Proceedings IEEE Nonrigid and Articulated Motion Workshop.

[90]  Cordelia Schmid,et al.  Expanded Parts Model for Human Attribute and Action Recognition in Still Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[91]  David A. Forsyth,et al.  Describing objects by their attributes , 2009, CVPR.

[92]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[93]  Wen Qu,et al.  Action-scene Model for Human Action Recognition from Videos , 2014 .

[94]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[95]  Bryan W. Scotney,et al.  Complex event recognition with uncertainty reasoning , 2013, 2013 International Conference on Machine Learning and Cybernetics.

[96]  Jack L. Gallant,et al.  A Continuous Semantic Space Describes the Representation of Thousands of Object and Action Categories across the Human Brain , 2012, Neuron.

[97]  Peter V. Gehler,et al.  Poselet Conditioned Pictorial Structures , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[98]  Mario Cannataro,et al.  Protein-to-protein interactions: Technologies, databases, and algorithms , 2010, CSUR.

[99]  Jiqing Liu,et al.  Action Recognition with Trajectory and Scene , 2012, 2012 Fourth International Conference on Digital Home.

[100]  P. Downing,et al.  Selectivity for the human body in the fusiform gyrus. , 2005, Journal of neurophysiology.

[101]  Raymond J. Mooney,et al.  Improving Video Activity Recognition using Object Recognition and Text Mining , 2012, ECAI.

[102]  Jake K. Aggarwal,et al.  Recognition of human interaction using multiple features in gray scale images , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[103]  Roman Filipovych,et al.  Recognizing primitive interactions by exploring actor-object states , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[104]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[105]  Snehasis Mukherjee,et al.  Recognizing interactions between human performers by ‘Dominating Pose Doublet’ , 2013, Machine Vision and Applications.

[106]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[107]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[108]  Subhransu Maji,et al.  Detecting People Using Mutually Consistent Poselet Activations , 2010, ECCV.

[109]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[110]  Pinar Duygulu Sahin,et al.  Human Action Recognition Using Distribution of Oriented Rectangular Patches , 2007, Workshop on Human Motion.

[111]  G. Rizzolatti,et al.  Action recognition in the premotor cortex. , 1996, Brain : a journal of neurology.

[112]  Ying Wu,et al.  Mining discriminative 3D Poselet for cross-view action recognition , 2014, IEEE Winter Conference on Applications of Computer Vision.

[113]  Mohan M. Trivedi,et al.  Human Pose Estimation and Activity Recognition From Multi-View Videos: Comparative Explorations of Recent Developments , 2012, IEEE Journal of Selected Topics in Signal Processing.

[114]  Michael Felsberg,et al.  Coloring Action Recognition in Still Images , 2013, International Journal of Computer Vision.

[115]  Chunheng Wang,et al.  Attribute Regularization Based Human Action Recognition , 2013, IEEE Transactions on Information Forensics and Security.

[116]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.