Improving bag-of-features action recognition with non-local cues

Local space-time features have recently shown promising results within Bag-of-Features (BoF) approach to action recognition in video. Pure local features and descriptors, however, provide only limited discriminative power implying ambiguity among features and sub-optimal classification performance. In this work, we propose to disambiguate local space-time features and to improve action recognition by integrating additional nonlocal cues with BoF representation. For this purpose, we decompose video into region classes and augment local features with corresponding region-class labels. In particular, we investigate unsupervised and supervised video segmentation using (i) motion-based foreground segmentation, (ii) person detection, (iii) static action detection and (iv) object detection. While such segmentation methods might be imperfect, they provide complementary region-level information to local features. We demonstrate how this information can be integrated with BoF representations in a kernel-combination framework. We evaluate our method on the recent and challenging Hollywood-2 action dataset and demonstrate significant improvements.

[1]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[3]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[4]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, CVPR Workshops.

[5]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Hans Jørgen Andersen,et al.  British Machine Vision Conference 2006 , 2006 .

[8]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[9]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[10]  Stefano Soatto,et al.  Localizing Objects with Smart Dictionaries , 2008, ECCV.

[11]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  Cordelia Schmid,et al.  A sparse texture representation using local affine regions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Frédéric Jurie,et al.  Randomized Clustering Forests for Image Classification , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[18]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[19]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[20]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[21]  Jean-Marc Odobez,et al.  Robust Multiresolution Estimation of Parametric Motion Models , 1995, J. Vis. Commun. Image Represent..

[22]  Vincent Lepetit,et al.  Randomized trees for real-time keypoint recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.