Dynamic Scene Recognition with Complementary Spatiotemporal Features

This paper presents Dynamically Pooled Complementary Features (DPCF), a unified approach to dynamic scene recognition that analyzes a short video clip in terms of its spatial, temporal and color properties. The complementarity of these properties is preserved through all main steps of processing, including primitive feature extraction, coding and pooling. In the feature extraction step, spatial orientations capture static appearance, spatiotemporal oriented energies capture image dynamics and color statistics capture chromatic information. Subsequently, primitive features are encoded into a mid-level representation that has been learned for the task of dynamic scene recognition. Finally, a novel dynamic spacetime pyramid is introduced. This dynamic pooling approach can handle both global as well as local motion by adapting to the temporal structure, as guided by pooling energies. The resulting system provides online recognition of dynamic scenes that is thoroughly evaluated on the two current benchmark datasets and yields best results to date on both datasets. In-depth analysis reveals the benefits of explicitly modeling feature complementarity in combination with the dynamic spacetime pyramid, indicating that this unified approach should be well-suited to many areas of video analysis.

[1]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[4]  Trevor Darrell,et al.  Beyond spatial pyramids: Receptive field learning for pooled image features , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Richard P. Wildes,et al.  Spacetime Stereo and 3D Flow via Binocular Spatiotemporal Orientation Analysis , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Andrei Zaharescu,et al.  Anomalous Behaviour Detection Using Spatiotemporal Oriented Energies, Subset Inclusion Histogram Comparison and Event-Driven Processing , 2010, ECCV.

[8]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Martin Szummer,et al.  Temporal texture modeling , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[10]  Mubarak Shah,et al.  Scene Modeling Using Co-Clustering , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Matthieu Cord,et al.  Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[15]  Hans Knutsson,et al.  Signal processing for computer vision , 1994 .

[16]  Nuno Vasconcelos,et al.  Probabilistic kernels for the classification of auto-regressive visual processes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  P. Lions,et al.  Axioms and fundamental equations of image processing , 1993 .

[18]  Anil K. Jain,et al.  Image classification for content-based indexing , 2001, IEEE Trans. Image Process..

[19]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[20]  Richard P. Wildes,et al.  The Applicability of Spatiotemporal Oriented Energy Features to Region Tracking , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  P. Perona,et al.  Rapid natural scene categorization in the near absence of attention , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Rama Chellappa,et al.  Moving vistas: Exploiting motion for describing scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Bernt Schiele,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) Semantic Modeling of Natural Scenes for Content-Based Image Retrieval , 2022 .

[24]  Nuno Vasconcelos,et al.  Scene classification with low-dimensional semantic spaces and weak supervision , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Richard P. Wildes,et al.  Spacetime Forests with Complementary Features for Dynamic Scene Recognition , 2013, BMVC.

[26]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Nicolas Le Roux,et al.  Ask the locals: Multi-way local pooling for image recognition , 2011, 2011 International Conference on Computer Vision.

[28]  Richard P. Wildes,et al.  Dynamically encoded actions based on spacetime saliency , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas V. Papathomas,et al.  Motion perception with spatiotemporally matched chromatic and achromatic information reveals a “slow” and a “fast” motion system , 1993, Vision Research.

[30]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[31]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Richard P. Wildes,et al.  Dynamic scene understanding: The role of orientation features in space and time in scene classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[34]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[35]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[36]  S. Thorpe,et al.  How parallel is visual processing in the ventral pathway? , 2004, Trends in Cognitive Sciences.

[37]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[38]  Richard P. Wildes,et al.  Bags of Spacetime Energies for Dynamic Scene Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  James V. Stone Vision and Brain: How We Perceive the World , 2012 .

[40]  F. Xavier Roca,et al.  Compact and adaptive spatial pyramids for scene recognition , 2012, Image Vis. Comput..

[41]  Richard P. Wildes,et al.  Action Spotting and Recognition Based on a Spatiotemporal Orientation Analysis , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  S. Engel,et al.  Colour tuning in human visual cortex measured with functional magnetic resonance imaging , 1997, Nature.

[43]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[44]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[45]  Thomas Serre,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  M. Potter,et al.  Recognition memory for a rapid sequence of pictures. , 1969, Journal of experimental psychology.

[47]  Richard P. Wildes,et al.  Qualitative Spatiotemporal Analysis Using an Oriented Energy Representation , 2000, ECCV.

[48]  Richard P. Wildes,et al.  Spacetime Texture Representation and Recognition Based on a Spatiotemporal Orientation Analysis , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Yi Yang,et al.  Weakly supervised sparse coding with geometric consistency pooling , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[51]  Bingbing Ni,et al.  Geometric ℓp-norm feature pooling for image classification , 2011, CVPR 2011.

[52]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[53]  Andrei Gorea,et al.  Two carriers for motion perception: Color and luminance , 1991, Vision Research.

[54]  Andrew B. Watson,et al.  A look at motion in the frequency domain , 1983 .