Visual Representations: Defining Properties and Deep Approximations

Visual representations are defined in terms of minimal sufficient statistics of visual data, for a class of tasks, that are also invariant to nuisance variability. Minimal sufficiency guarantees that we can store a representation in lieu of raw data with smallest complexity and no performance loss on the task at hand. Invariance guarantees that the statistic is constant with respect to uninformative transformations of the data. We derive analytical expressions for such representations and show they are related to feature descriptors commonly used in computer vision, as well as to convolutional neural networks. This link highlights the assumptions and approximations tacitly assumed by these methods and explains empirical practices such as clamping, pooling and joint normalization.

[1]  R. R. Bahadur Sufficiency and Statistical Decision Functions , 1954 .

[2]  Walter L. Smith Probability and Statistics , 1959, Nature.

[3]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[4]  D. Blackwell,et al.  A Bayes but Not Classically Sufficient Statistic , 1982 .

[5]  G. S. Watson Statistics on Spheres , 1983 .

[6]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  P. Lions,et al.  Axioms and fundamental equations of image processing , 1993 .

[9]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[10]  David Mumford,et al.  Statistics of natural images and models , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[11]  Tony Lindeberg,et al.  Principles for Automatic Scale Selection , 1999 .

[12]  Pietro Perona,et al.  Unsupervised Learning of Models for Recognition , 2000, ECCV.

[13]  Mi-Suen Lee,et al.  A Computational Framework for Segmentation and Grouping , 2000 .

[14]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[15]  Roger Sauter,et al.  In All Likelihood , 2002, Technometrics.

[16]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[17]  Pietro Perona,et al.  Evaluation of Features Detectors and Descriptors based on 3D Objects , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[18]  S. Smale,et al.  Shannon sampling II: Connections to learning theory , 2005 .

[19]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[20]  Stefano Soatto,et al.  Multi-View Stereo Reconstruction of Dense Shape and Complex Appearance , 2005, International Journal of Computer Vision.

[21]  Stefano Soatto,et al.  Features for recognition: viewpoint invariance for non-planar scenes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[22]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[23]  A. Naderi Minimal sufficient statistics emerge from the observed likelihood functions , 2006 .

[24]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[25]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[28]  Stefano Soatto,et al.  Actionable information in vision , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[29]  Stefano Soatto,et al.  On the set of images modulo viewpoint and contrast changes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Naftali Tishby,et al.  Past-future information bottleneck in dynamical systems. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[31]  Lorenzo Rosasco,et al.  On Invariance in Hierarchical Models , 2009, NIPS.

[32]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[33]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[34]  Geoffrey E. Hinton,et al.  Modeling the joint density of two images under a variety of transformations , 2011, CVPR 2011.

[35]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[36]  Stéphane Mallat,et al.  Classification with scattering operators , 2010, CVPR 2011.

[37]  Chao Chen,et al.  Diffusion runs low on persistence fast , 2011, 2011 International Conference on Computer Vision.

[38]  J. Morel,et al.  Is SIFT scale invariant , 2011 .

[39]  Stefano Soatto,et al.  Detachable Object Detection: Segmentation and Depth Ordering from Short-Baseline Video , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Yann LeCun,et al.  Learning Invariant Feature Hierarchies , 2012, ECCV Workshops.

[41]  Stefano Soatto,et al.  Visual Correspondence, the Lambert-Ambient Shape Space and the Systematic Design of Feature Descriptors , 2014, Registration and Recognition in Images and Videos.

[42]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[43]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[44]  Thomas Brox,et al.  Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT , 2014, ArXiv.

[45]  Andrew Zisserman,et al.  Learning Local Feature Descriptors Using Convex Optimisation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Aram Galstyan,et al.  Maximally Informative Hierarchical Representations of High-Dimensional Data , 2014, AISTATS.

[47]  Jitendra Malik,et al.  Deformable part models are convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Ying Nian Wu,et al.  Generative Modeling of Convolutional Neural Networks , 2014, ICLR.

[49]  Lorenzo Rosasco,et al.  On Invariance and Selectivity in Representation Learning , 2015, ArXiv.

[50]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Stefano Soatto,et al.  Domain-size pooling in local descriptors: DSP-SIFT , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Stefano Soatto,et al.  Visual Scene Representations: Scaling and Occlusion in Convolutional Architectures , 2014, ICLR.

[54]  Richard G. Baraniuk,et al.  A Probabilistic Theory of Deep Learning , 2015, ArXiv.

[55]  Tao Xiang,et al.  Sketch-a-Net that Beats Humans , 2015, BMVC.

[56]  Matthew R. Kirchner Automatic thresholding of SIFT descriptors , 2016, 2016 IEEE International Conference on Image Processing (ICIP).