Deep Insights into Convolutional Networks for Video Recognition

As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing the internal representation of models that have been trained to recognize actions in video. We visualize multiple two-stream architectures to show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncrasies of training data and to explain failure cases of the system.

[1]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[2]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Graham W. Taylor,et al.  Adversarial Training Versus Weight Decay , 2018, ArXiv.

[4]  Jonathan Tompson,et al.  Unsupervised Feature Learning from Temporal Data , 2015, ICLR.

[5]  Andrea Vedaldi,et al.  Visualizing Deep Convolutional Neural Networks Using Natural Pre-images , 2015, International Journal of Computer Vision.

[6]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[7]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[9]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.

[10]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[11]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[13]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[14]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[15]  Alexander Mordvintsev,et al.  Inceptionism: Going Deeper into Neural Networks , 2015 .

[16]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[19]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  M. Goodale,et al.  Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[23]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[24]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[25]  Keiji Tanaka,et al.  Connections between Anterior Inferotemporal Cortex and Superior Temporal Sulcus Regions in the Macaque Monkey , 2000, The Journal of Neuroscience.

[26]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[29]  Thomas Brox,et al.  Synthesizing the preferred inputs for neurons in neural networks via deep generator networks , 2016, NIPS.

[30]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Richard P. Wildes,et al.  Spatiotemporal Multiplier Networks for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Werner Reichardt,et al.  Figure-ground discrimination by relative movement in the visual system of the fly , 2004, Biological Cybernetics.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[37]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[38]  P Gouras,et al.  Opponent‐colour cells in different layers of foveal striate cortex , 1974, The Journal of physiology.

[39]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  S. Klein,et al.  Opponent-movement mechanisms in human vision. , 1984, Journal of the Optical Society of America. A, Optics and image science.

[41]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[42]  Leslie G. Ungerleider,et al.  Object vision and spatial vision: two cortical pathways , 1983, Trends in Neurosciences.

[43]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[44]  Thomas V. Papathomas,et al.  Double opponency as a generalized concept in texture segregation illustrated with stimuli defined by color, luminance, and orientation , 1993 .

[45]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[46]  Li Fei-Fei,et al.  Progressive Neural Architecture Search , 2017, ECCV.

[47]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[48]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[49]  Thomas S. Huang,et al.  Non-Local Kernel Regression for Image and Video Restoration , 2010, ECCV.

[50]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[51]  Andrea Vedaldi,et al.  Salient Deconvolutional Networks , 2016, ECCV.

[52]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[53]  Richard P. Wildes,et al.  Dynamically encoded actions based on spacetime saliency , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Yoshua Bengio,et al.  Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Andrew Zisserman,et al.  What have We Learned from Deep Representations for Action Recognition? , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  N. Kanwisher,et al.  Activation in Human MT/MST by Static Images with Implied Motion , 2000, Journal of Cognitive Neuroscience.

[57]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  D. Hubel,et al.  Anatomy and physiology of a color system in the primate visual cortex , 1984, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[59]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.