Video segmentation and feature co-occurrences for activity classification

Bag-of-Word scheme has almost become de rigueur for event recognition tasks due to its robustness and simplicity. Despite its effectiveness, this technique discards spatial and temporal relationships between codewords. This paper tackles the problem of building a video codeword representation that captures such relationships. We developed a new method that harnesses spatio-temporal boundaries and discriminative codeword co-occurrences. Given a set of videos and their corresponding quantized features, the video is first decomposed in spatio-temporal volumes according to a multi-scale video segmentation algorithm. Meaningful codeword co-occurrences are then extracted within each volume and videos are then represented with histograms of co-occurring features. The set of histograms is finally fed to an SVM for classification. Evaluation under the realistic TRECVID MED11 challenge database validates the approach.

[1]  Mohammed J. Zaki,et al.  Mining features for sequence classification , 1999, KDD '99.

[2]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[4]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Ramakant Nevatia,et al.  Learning neighborhood cooccurrence statistics of sparse features for human activity recognition , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[6]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Xiaogang Wang,et al.  Action Recognition Using Topic Models , 2011, Visual Analysis of Humans.

[8]  Ramakant Nevatia,et al.  Video segmentation with spatio-temporal tubes , 2013, 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[9]  Lihi Zelnik-Manor,et al.  Incorporating temporal context in Bag-of-Words models , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[10]  S. Süsstrunk,et al.  SLIC Superpixels ? , 2010 .

[11]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[12]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Alberto Del Bimbo,et al.  Video event classification using string kernels , 2010, Multimedia Tools and Applications.

[14]  Bernt Schiele,et al.  Discovery of activity patterns using topic models , 2008 .

[15]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[16]  Pietro Perona,et al.  Towards automatic discovery of object categories , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[17]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[18]  Ivan Laptev,et al.  Track to the future: Spatio-temporal video segmentation with long-range motion cues , 2011, CVPR 2011.

[19]  Chenliang Xu,et al.  Evaluation of super-voxel methods for early video processing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Werner Bailer,et al.  A Feature Sequence Kernel for Video Concept Classification , 2011, MMM.

[21]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[23]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Chunheng Wang,et al.  Action Recognition Using Context-Constrained Linear Coding , 2012, IEEE Signal Processing Letters.

[26]  Saturnino Maldonado-Bascón,et al.  Visual Word Aggregation , 2011, IbPRIA.

[27]  Horst Bischof,et al.  Motion estimation with non-local total variation regularization , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[29]  Graham Coleman,et al.  Detection and explanation of anomalous activities: representing activities as bags of event n-grams , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).