CSMMI: Class-Specific Maximization of Mutual Information for Action and Gesture Recognition

In this paper, we propose a novel approach called class-specific maximization of mutual information (CSMMI) using a submodular method, which aims at learning a compact and discriminative dictionary for each class. Unlike traditional dictionary-based algorithms, which typically learn a shared dictionary for all of the classes, we unify the intraclass and interclass mutual information (MI) into an single objective function to optimize class-specific dictionary. The objective function has two aims: 1) maximizing the MI between dictionary items within a specific class (intrinsic structure) and 2) minimizing the MI between the dictionary items in a given class and those of the other classes (extrinsic structure). We significantly reduce the computational complexity of CSMMI by introducing an novel submodular method, which is one of the important contributions of this paper. This paper also contributes a state-of-the-art end-to-end system for action and gesture recognition incorporating CSMMI, with feature extraction, learning initial dictionary per each class by sparse coding, CSMMI via submodularity, and classification based on reconstruction errors. We performed extensive experiments on synthetic data and eight benchmark data sets. Our experimental results show that CSMMI outperforms shared dictionary methods and that our end-to-end system is competitive with other state-of-the-art approaches.

[1]  Wei Li,et al.  One-shot learning gesture recognition from RGB-D data using bag of features , 2013, J. Mach. Learn. Res..

[2]  Guillermo Sapiro,et al.  Discriminative learned dictionaries for local image analysis , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Qiuqi Ruan,et al.  Activity Recognition from RGB-D Camera with 3D Local Spatio-temporal Features , 2012, 2012 IEEE International Conference on Multimedia and Expo.

[4]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[5]  Maja Pantic,et al.  Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences , 2011, IEEE Transactions on Image Processing.

[6]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[8]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[11]  Mubarak Shah,et al.  Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[13]  Yui Man Lui,et al.  Human gesture recognition on product manifolds , 2012, J. Mach. Learn. Res..

[14]  Larry S. Davis,et al.  Recognizing Human Actions by Learning and Matching Shape-Motion Prototype Trees , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Isabelle Guyon,et al.  ChaLearn gesture challenge: Design and first results , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[16]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[17]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[18]  Mubarak Shah,et al.  Learning semantic features for action recognition via diffusion maps , 2012, Comput. Vis. Image Underst..

[19]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[20]  Rama Chellappa,et al.  Sparse dictionary-based representation and recognition of action attributes , 2011, 2011 International Conference on Computer Vision.

[21]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[23]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[24]  Patrick Pérez,et al.  View-Independent Action Recognition from Temporal Self-Similarities , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[26]  Sergio Escalera,et al.  ChaLearn multi-modal gesture recognition 2013: grand challenge and workshop summary , 2013, ICMI '13.

[27]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[28]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[29]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Lei Zhang,et al.  Sparsity-based image denoising via dictionary learning and structural clustering , 2011, CVPR 2011.

[31]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Cordelia Schmid,et al.  Packing bag-of-features , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Larry S. Davis,et al.  Submodular dictionary learning for sparse coding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[37]  Tanaya Guha,et al.  Learning Sparse Representations for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  David Zhang,et al.  Fisher Discrimination Dictionary Learning for sparse representation , 2011, 2011 International Conference on Computer Vision.

[39]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Jingjing Zheng,et al.  Learning View-Invariant Sparse Representations for Cross-View Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Andreas Krause,et al.  Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies , 2008, J. Mach. Learn. Res..

[44]  Svetlana Lazebnik,et al.  Supervised Learning of Quantizer Codebooks by Information Loss Minimization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[46]  Wei Li,et al.  3D SMoSIFT: three-dimensional sparse motion scale invariant feature transform for activity recognition from RGB-D videos , 2014, J. Electronic Imaging.

[47]  Venu Govindaraju,et al.  A temporal Bayesian model for classifying, detecting and localizing activities in video sequences , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[48]  Janusz Konrad,et al.  Action Recognition From Video Using Feature Covariance Matrices , 2013, IEEE Transactions on Image Processing.

[49]  Silvio Savarese,et al.  Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[50]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51]  Larry S. Davis,et al.  Learning a discriminative dictionary for sparse coding via label consistent K-SVD , 2011, CVPR 2011.

[52]  Rong Jin,et al.  Unifying discriminative visual codebook generation with classifier training for object category recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Isabelle Guyon,et al.  Results and Analysis of the ChaLearn Gesture Challenge 2012 , 2012, WDIA.