Semantic action recognition by learning a pose lexicon

Abstract This paper proposes a semantic representation, pose lexicon , for action recognition. The lexicon is composed of a set of semantic poses, a set of visual poses and a probabilistic mapping between the visual and semantic poses. Specially, an action can be represented by a sequence of semantic poses extracted from an associated textual instruction. Visual frames of the action are considered to be generated from a sequence of hidden visual poses. To learn the lexicon, a visual pose model is learned from training samples by a Gaussian Mixture model to characterize the likelihood of an observed visual frame being generated by a visual pose. A pose lexicon model is also learned by an extended hidden Markov alignment model to encode the probabilistic mapping between hidden visual poses and semantic poses sequences. With the lexicon, action classification is formulated as a problem of finding the maximum posterior probability of a given sequence of visual frames that fits to a given sequence of semantic poses through the most likely visual pose and alignment sequences. The efficacy of the proposed method was evaluated on MSRC-12, WorkoutSU-10, WorkoutUOW-18, Combined-15 and Combined-17 action datasets using cross-subject, cross-dataset and zero-shot protocols.

[1]  Chunfeng Yuan,et al.  A Hierarchical Model Based on Latent Dirichlet Allocation for Action Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[2]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[3]  Leonard Talmy,et al.  (1) Lexicalization patterns: Semantic structure in lexical forms; and , 1987 .

[4]  Wanqing Li,et al.  Learning a pose lexicon for semantic action recognition , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[5]  Raymond J. Mooney,et al.  Improving Video Activity Recognition using Object Recognition and Text Mining , 2012, ECAI.

[6]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[7]  Christian Bauckhage,et al.  Efficient Pose-Based Action Recognition , 2014, ACCV.

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[10]  Jing Zhang,et al.  RGB-D-based action recognition datasets: A survey , 2016, Pattern Recognit..

[11]  Jing Zhang,et al.  Action Recognition From Depth Maps Using Deep Convolutional Neural Networks , 2016, IEEE Transactions on Human-Machine Systems.

[12]  Helena M. Mentis,et al.  Instructing people for training gestural interactive systems , 2012, CHI.

[13]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.

[14]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[15]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Juan Carlos Niebles,et al.  Discriminative Hierarchical Modeling of Spatio-temporally Composable Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Marcus Rohrbach,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[18]  Zicheng Liu,et al.  Expandable Data-Driven Graphical Modeling of Human Actions Based on Salient Postures , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Silvio Savarese,et al.  Watch-n-patch: Unsupervised understanding of actions and relations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Philip S. Yu,et al.  Transfer Feature Learning with Joint Distribution Adaptation , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Pichao Wang,et al.  Mining Mid-Level Features for Action Recognition Based on Effective Skeleton Representation , 2014, 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[22]  Jing Zhang,et al.  ConvNets-Based Action Recognition from Depth Maps through Virtual Cameras and Pseudocoloring , 2015, ACM Multimedia.

[23]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[24]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[25]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[26]  Pichao Wang,et al.  Online human action recognition based on incremental learning of weighted covariance descriptors , 2018, Inf. Sci..

[27]  Abdullah Al Mamun,et al.  Unsupervised Alignment of Actions in Video with Text Descriptions , 2016, IJCAI.

[28]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[29]  Hassan Foroosh,et al.  Action Recognition by Weakly-Supervised Discriminative Region Localization , 2014, BMVC.

[30]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[31]  Massimo Piccardi,et al.  Training Initialization of Hidden Markov Models in Human Action Recognition , 2014, IEEE Transactions on Automation Science and Engineering.

[32]  Steve Marschner,et al.  Caliber: Camera Localization and Calibration Using Rigidity Constraints , 2016, International Journal of Computer Vision.

[33]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[34]  Chunheng Wang,et al.  Attribute Regularization Based Human Action Recognition , 2013, IEEE Transactions on Information Forensics and Security.

[35]  Jiebo Luo,et al.  Unsupervised Alignment of Natural Language Instructions with Video Segments , 2014, AAAI.

[36]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[38]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[39]  Jiebo Luo,et al.  Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments , 2015, HLT-NAACL.

[40]  Aytül Erçil,et al.  A Decision Forest Based Feature Selection Framework for Action Recognition from RGB-Depth Cameras , 2013, ICIAR.

[41]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[42]  Yi Wang,et al.  Sequential Max-Margin Event Detectors , 2014, ECCV.

[43]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[44]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Ulrich Germann,et al.  Greedy Decoding for Statistical Machine Translation in Almost Linear Time , 2003, NAACL.

[46]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[47]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[48]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[49]  Yunde Jia,et al.  A Hierarchical Video Description for Complex Activity Understanding , 2016, International Journal of Computer Vision.

[50]  Jing Zhang,et al.  A Large Scale RGB-D Dataset for Action Recognition , 2016, UHA3DS@ICPR.

[51]  Greg Mori,et al.  IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL., NO. 1 Human Action Recognition by Semi-Latent Topic Models , 2022 .

[52]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[53]  Joakim Nivre,et al.  An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[54]  Yue Zhang,et al.  Fast and Accurate Shift-Reduce Constituent Parsing , 2013, ACL.

[55]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[56]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Providen e RIe Immediate-Head Parsing for Language Models , 2001 .

[58]  Cordelia Schmid,et al.  Weakly-Supervised Alignment of Video with Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[60]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Wanqing Li,et al.  Discriminative Key Pose Extraction Using Extended LC-KSVD for Action Recognition , 2014, 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[62]  Alon Lavie,et al.  Unsupervised Word Alignment with Arbitrary Features , 2011, ACL.

[63]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..