Temporal Residual Networks for Dynamic Scene Recognition

This paper combines three contributions to establish a new state-of-the-art in dynamic scene recognition. First, we present a novel ConvNet architecture based on temporal residual units that is fully convolutional in spacetime. Our model augments spatial ResNets with convolutions across time to hierarchically add temporal residuals as the depth of the network increases. Second, existing approaches to video-based recognition are categorized and a baseline of seven previously top performing algorithms is selected for comparative evaluation on dynamic scenes. Third, we introduce a new and challenging video database of dynamic scenes that more than doubles the size of those previously available. This dataset is explicitly split into two subsets of equal size that contain videos with and without camera motion to allow for systematic study of how this variable interacts with the defining dynamics of the scene per se. Our evaluations verify the particular strengths and weaknesses of the baseline algorithms with respect to various scene classes and camera motion parameters. Finally, our temporal ResNet boosts recognition performance and establishes a new state-of-the-art on dynamic scene recognition, as well as on the complementary task of action recognition.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Subhransu Maji,et al.  Deep filter banks for texture recognition and segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Anil K. Jain,et al.  Image classification for content-based indexing , 2001, IEEE Trans. Image Process..

[6]  Shmuel Peleg,et al.  Compact CNN for indexing egocentric videos , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7]  Richard P. Wildes,et al.  Bags of Spacetime Energies for Dynamic Scene Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[9]  TorralbaA.,et al.  Using the forest to see the trees , 2010 .

[10]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[11]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Richard P. Wildes,et al.  Spacetime Forests with Complementary Features for Dynamic Scene Recognition , 2013, BMVC.

[13]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[14]  Kavita Bala,et al.  Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[18]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  A. Smeaton,et al.  TRECVID 2013 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics | NIST , 2011 .

[20]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[21]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Richard P. Wildes,et al.  Dynamic scene understanding: The role of orientation features in space and time in scene classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[24]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[25]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[26]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[27]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[28]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[29]  Matthieu Cord,et al.  Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Deva Ramanan,et al.  Exploring Weak Stabilization for Motion Feature Extraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[35]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[36]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[37]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[38]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[39]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Antonio Torralba,et al.  Using the forest to see the trees: exploiting context for visual object detection and localization , 2010, CACM.

[41]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[42]  Rama Chellappa,et al.  Moving vistas: Exploiting motion for describing scenes , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Mary C. Potter,et al.  Recognition and Memory for Briefly Presented Scenes , 2012, Front. Psychology.

[46]  Hongbin Zha,et al.  Structure-Sensitive Superpixels via Geodesic Distance , 2011, 2011 International Conference on Computer Vision.

[47]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[48]  Richard P. Wildes,et al.  Dynamic Scene Recognition with Complementary Spatiotemporal Features , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[50]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).