Supervised Learning and Codebook Optimization for Bag-of-Words Models

In this paper, we present a novel approach for supervised codebook learning and optimization for bag-of-words models. This type of models is frequently used in visual recognition tasks like object class recognition or human action recognition. An entity is represented as a histogram of codewords, which are traditionally clustered with unsupervised methods like k-means or random forests and then classified in a supervised way. We propose a new supervised method for joint codebook creation and class learning, which learns the cluster centers of the codebook in a goal-directed way using the class labels of the training set. As a result, the codebook is highly correlated to the recognition problem, leading to a more discriminative codebook. We propose two different learning algorithms, one based on error backpropagation and the other based on cluster label reassignment. We apply the proposed method to human action recognition from video sequences and evaluate it on the KTH data set, reporting very promising results. The proposed technique allows us to improve the discriminative power of an unsupervised learned codebook or to keep the discriminative power while decreasing the size of the learned codebook, thus decreasing the computational complexity due to the nearest neighbor search.

[1]  Jake K. Aggarwal,et al.  Stochastic Representation and Recognition of High-Level Group Activities , 2011, International Journal of Computer Vision.

[2]  Wen Gao,et al.  Mining Layered Grammar Rules for Action Recognition , 2011, International Journal of Computer Vision.

[3]  Michal Irani,et al.  Detecting Irregularities in Images and in Video , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[4]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[5]  Hanning Zhou,et al.  Unusual Event Detection via Multi-camera Video Mining , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[6]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[8]  Sukhendu Das,et al.  Trajectory representation using Gabor features for motion-based video retrieval , 2009, Pattern Recognit. Lett..

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Rama Chellappa,et al.  Activity Modeling Using Event Probability Sequences , 2008, IEEE Transactions on Image Processing.

[12]  Christian Wolf,et al.  Recognizing and Localizing Individual Activities through Graph Matching , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[13]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[15]  Deepu Rajan,et al.  Embedding Visual Words into Concept Space for Action and Scene Recognition , 2010, BMVC.

[16]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Liang-Tien Chia,et al.  Motion Context: A New Representation for Human Action Recognition , 2008, ECCV.

[18]  Peyman Milanfar,et al.  Action Recognition from One Example , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[20]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[21]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[22]  Mubarak Shah,et al.  Recognizing human actions using multiple features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Krystian Mikolajczyk,et al.  Action recognition with appearance-motion features and fast search trees , 2011, Comput. Vis. Image Underst..

[24]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[25]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[26]  Maja Pantic,et al.  An implicit spatiotemporal shape model for human activity localization and recognition , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[27]  W. Eric L. Grimson,et al.  Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Yang Song,et al.  Unsupervised Learning of Human Motion , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[30]  Shaogang Gong,et al.  Activity based surveillance video content modelling , 2008, Pattern Recognit..

[31]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Samy Bengio,et al.  Semi-supervised adapted HMMs for unusual event detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  Kotagiri Ramamohanarao,et al.  Moving shape dynamics: A signal processing perspective , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Rama Chellappa,et al.  Silhouette-based gesture and action recognition via modeling trajectories on Riemannian shape manifolds , 2011, Comput. Vis. Image Underst..

[35]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[36]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[37]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[38]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[39]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[40]  Jean-Michel Jolion,et al.  Pairwise Features for Human Action Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[41]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[42]  Krystian Mikolajczyk,et al.  Action recognition with motion-appearance vocabulary forest , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Xinghua Sun,et al.  Action recognition via local descriptors and holistic features , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[44]  Mubarak Shah,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[47]  S. Gong,et al.  Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[49]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[50]  Andrew Gilbert,et al.  Action Recognition Using Mined Hierarchical Compound Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Shaogang Gong,et al.  Incremental and adaptive abnormal behaviour detection , 2008, Comput. Vis. Image Underst..

[52]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.