Multi-feature max-margin hierarchical Bayesian model for action recognition

In this paper, a multi-feature max-margin hierarchical Bayesian model (M3HBM) is proposed for action recognition. Different from existing methods which separate representation and classification into two steps, M3HBM jointly learns a high-level representation by combining a hierarchical generative model (HGM) and discriminative max-margin classifiers in a unified Bayesian framework. Specifically, HGM is proposed to represent actions by distributions over latent spatial temporal patterns (STPs) which are learned from multiple feature modalities and shared among different classes. For recognition, we employ Gibbs classifiers to minimize the expected loss function based on the max-margin principle and use the classifiers as regularization terms of M3HBM to perform Bayeisan estimation for classifier parameters together with the learning of STPs. In addition, multi-task learning is applied to learn the model from multiple feature modalities for different classes. For test videos, we obtain the representations by the inference process and perform action recognition by the learned Gibbs classifiers. For the learning and inference process, we derive an efficient Gibbs sampling algorithm to solve the proposed M3HBM. Extensive experiments on several datasets demonstrate both the representation power and the classification capability of our approach for action recognition.

[1]  Zhi-Hua Zhou,et al.  Multi-Modal Image Annotation with Multi-Instance Multi-Label LDA , 2013, IJCAI.

[2]  Xiaogang Wang,et al.  Random field topic model for semantic region analysis in crowded scenes from tracklets , 2011, CVPR 2011.

[3]  Christian Bauckhage,et al.  Action recognition by learning discriminative key poses , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[4]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[5]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[6]  Gang Hua,et al.  Context aware topic model for scene recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[8]  Bruce A. Draper,et al.  Scalable action recognition with a subspace forest , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Shaogang Gong,et al.  Video Behaviour Mining Using a Dynamic Topic Model , 2011, International Journal of Computer Vision.

[10]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[11]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Nicholas G. Polson,et al.  Data augmentation for support vector machines , 2011 .

[15]  Jianxiong Xiao,et al.  Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Venu Govindaraju,et al.  A temporal Bayesian model for classifying, detecting and localizing activities in video sequences , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[17]  Tieniu Tan,et al.  Relevance Topic Model for Unstructured Social Group Activity Recognition , 2013, NIPS.

[18]  Ning Chen,et al.  Gibbs max-margin topic models with data augmentation , 2013, J. Mach. Learn. Res..

[19]  Yang Wang,et al.  Max-margin Latent Dirichlet Allocation for Image Classification and Annotation , 2011, BMVC.

[20]  Yale Song,et al.  Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[22]  Amit K. Roy-Chowdhury,et al.  Context-Aware Modeling and Recognition of Activities in Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[24]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[28]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[31]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.