Latent Pose Estimator for Continuous Action Recognition

Recently, models based on conditional random fields (CRF) have produced promising results on labeling sequential data in several scientific fields. However, in the vision task of continuous action recognition, the observations of visual features have dimensions as high as hundreds or even thousands. This might pose severe difficulties on parameter estimation and even degrade the performance. To bridge the gap between the high dimensional observations and the random fields, we propose a novel model that replace the observation layer of a traditional random fields model with a latent pose estimator. In training stage, the human pose is not observed in the action data, and the latent pose estimator is learned under the supervision of the labeled action data, instead of image-to-pose data. The advantage of this model is twofold. First, it learns to convert the high dimensional observations into more compact and informative representations. Second, it enables transfer learning to fully utilize the existing knowledge and data on image-to-pose relationship. The parameters of the latent pose estimator and the random fields are jointly optimized through a gradient ascent algorithm. Our approach is tested on HumanEva [1] --- a publicly available dataset. The experiments show that our approach can improve recognition accuracy over standard CRF model and its variations. The performance can be further significantly improved by using additional image-to-pose data for training. Our experiments also show that the model trained on HumanEva can generalize to different environment and human subjects.

[1]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[4]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  David A. Forsyth,et al.  Automatic Annotation of Everyday Movements , 2003, NIPS.

[7]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[8]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Mubarak Shah,et al.  Recognizing human actions in videos acquired by uncalibrated moving cameras , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[11]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion , 2006 .

[12]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Rama Chellappa,et al.  Epitomic Representation of Human Activities , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Cristian Sminchisescu,et al.  BM³E : Discriminative Density Propagation for Visual Tracking , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Yihong Gong,et al.  Discriminative learning of visual words for 3D human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.