A Probabilistic Semi-Supervised Approach to Multi-Task Human Activity Modeling

Video-based prediction of human activity is usually performed on one of two levels: either a model is trained to anticipate high-level action labels or it is trained to predict future trajectories either in skeletal joint space or in image pixel space. This separation of classification and regression tasks implies that models cannot make use of the mutual information between continuous and semantic observations. However, if a model knew that an observed human wants to drink from a nearby glass, the space of possible trajectories would be highly constrained to reaching movements. Likewise, if a model had predicted a reaching trajectory, the inference of future semantic labels would rank "lifting" more likely than "walking". In this work, we propose a semi-supervised generative latent variable model that addresses both of these levels by modeling continuous observations as well as semantic labels. This fusion of signals allows the model to solve several tasks, such as action detection and anticipation as well as motion prediction and synthesis, simultaneously. We demonstrate this ability on the UTKinect-Action3D dataset, which consists of noisy, partially labeled multi-action sequences. The aim of this work is to encourage research within the field of human activity modeling based on mixed categorical and continuous data.

[1]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[2]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Stefano Berretti,et al.  Representation, Analysis, and Recognition of 3D Humans , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[6]  Danica Kragic,et al.  Anticipating many futures: Online human motion prediction and synthesis for human-robot collaboration , 2017, ArXiv.

[7]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[8]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[9]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Stephan Mandt,et al.  Disentangled Sequential Autoencoder , 2018, ICML.

[12]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Hedvig Kjellström,et al.  Advances in Variational Inference , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ying Tan,et al.  Variational Autoencoder for Semi-Supervised Text Classification , 2017, AAAI.

[15]  Taku Komura,et al.  A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[16]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[17]  Arnav Bhavsar,et al.  Scale Invariant Human Action Detection from Depth Cameras Using Class Templates , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[18]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Austin Reiter,et al.  Interpretable 3D Human Action Analysis with Temporal Convolutional Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Yong Du,et al.  Skeleton based action recognition with convolutional neural network , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[21]  B. Schölkopf,et al.  Modeling Human Motion Using Binary Latent Variables , 2007 .

[22]  Ryan Cotterell,et al.  A Structured Variational Autoencoder for Contextual Morphological Inflection , 2018, ACL.

[23]  Nanning Zheng,et al.  View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[25]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[27]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[28]  Sergio Escalera,et al.  RGB-D-based Human Motion Recognition with Deep Learning: A Survey , 2017, Comput. Vis. Image Underst..

[29]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[30]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Juan Carlos Niebles,et al.  A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Gang Wang,et al.  SSNet: Scale Selection Network for Online 3D Action Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[35]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[36]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[38]  Graham Neubig,et al.  Multi-space Variational Encoder-Decoders for Semi-supervised Labeled Sequence Transduction , 2017, ACL.

[39]  Hong Cheng,et al.  Interactive body part contrast mining for human interaction recognition , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[40]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).