SA-CNN: Dynamic Scene Classification using Convolutional Neural Networks

The task of classifying videos of natural dynamic scenes into appropriate classes has gained lot of attention in recent years. The problem especially becomes challenging when the camera used to capture the video is dynamic. In this paper, we analyse the performance of statistical aggregation (SA) techniques on various pre-trained convolutional neural network(CNN) models to address this problem. The proposed approach works by extracting CNN activation features for a number of frames in a video and then uses an aggregation scheme in order to obtain a robust feature descriptor for the video. We show through results that the proposed approach performs better than the-state-of-the arts for the Maryland and YUPenn dataset. The final descriptor obtained is powerful enough to distinguish among dynamic scenes and is even capable of addressing the scenario where the camera motion is dominant and the scene dynamics are complex. Further, this paper shows an extensive study on the performance of various aggregation methods and their combinations. We compare the proposed approach with other dynamic scene classification algorithms on two publicly available datasets - Maryland and YUPenn to demonstrate the superior performance of the proposed approach.

[1]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[2]  Rama Chellappa,et al.  Moving vistas: Exploiting motion for describing scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[4]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[5]  Charles Guyon,et al.  Robust Principal Component Analysis for Background Subtraction: Systematic Evaluation and Comparative Analysis , 2012 .

[6]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[7]  Richard P. Wildes,et al.  Bags of Spacetime Energies for Dynamic Scene Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[11]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[12]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[13]  Richard P. Wildes,et al.  Dynamic scene understanding: The role of orientation features in space and time in scene classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  G. Sapiro,et al.  A collaborative framework for 3D alignment and classification of heterogeneous subvolumes in cryo-electron tomography. , 2013, Journal of structural biology.

[16]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[17]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[20]  Richard P. Wildes,et al.  Spacetime Forests with Complementary Features for Dynamic Scene Recognition , 2013, BMVC.

[21]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[24]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[25]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).