Visual Scene Representations: Sufficiency, Minimality, Invariance and Deep Approximations

Visual representations are functions of visual data that ar e minimal sufficient statistics for a class of tasks and maximally invariant to nuisance variability. Min imal sufficiency guarantees that we can store the statistic in lieu of the raw data with no performanc e loss and smallest complexity. Maximal invariance guarantees that the statistic is constant with r espect to unwanted transformations of the data, and nothing else. We derive an expression for such repr es ntations and show that, under certain restrictive assumptions, they are related to “feat ure descriptors” commonly in use in the computer vision community, as well as to increasingly popul ar convolutional architectures. This link highlights the conditions tacitly assumed by these des criptors and networks, under which they can be expected to perform well, and also suggests ways to imp r ve and generalize them, by lifting such assumptions. This new interpretation draws connectio ns o the classical theories of sampling, hypothesis testing and group invariance.

[1]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[2]  Stefano Soatto,et al.  Domain-size pooling in local descriptors: DSP-SIFT , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[4]  A. Naderi Minimal sufficient statistics emerge from the observed likelihood functions , 2006 .

[5]  P. Lions,et al.  Axioms and fundamental equations of image processing , 1993 .

[6]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Stéphane Mallat,et al.  Classification with scattering operators , 2010, CVPR 2011.

[8]  Thomas Brox,et al.  Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT , 2014, ArXiv.

[9]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[10]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[11]  Carlo Tomasi,et al.  Histograms of Oriented Gradients , 2015 .

[12]  Jonathan Balzer,et al.  On the Design and Analysis of Multiple View Descriptors , 2013, ArXiv.

[13]  Geoffrey E. Hinton,et al.  Modeling the joint density of two images under a variety of transformations , 2011, CVPR 2011.

[14]  R. R. Bahadur Sufficiency and Statistical Decision Functions , 1954 .

[15]  Lorenzo Rosasco,et al.  On Invariance in Hierarchical Models , 2009, NIPS.

[16]  Yann LeCun,et al.  Learning Invariant Feature Hierarchies , 2012, ECCV Workshops.

[17]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints Abstract by Matthijs Dorst Based on the paper by , 2011 .

[18]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[19]  Stefano Soatto,et al.  Actionable information in vision , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[21]  M. Brady,et al.  Scale Saliency: a novel approach to salient feature and scale selection , 2003 .

[22]  J. Morel,et al.  Is SIFT scale invariant , 2011 .

[23]  Antonio Torralba,et al.  HOGgles: Visualizing Object Detection Features , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  Andrew Zisserman,et al.  Learning Local Feature Descriptors Using Convex Optimisation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  G. S. Watson Statistics on Spheres , 1983 .

[27]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[28]  Tony Lindeberg,et al.  Principles for Automatic Scale Selection , 1999 .

[29]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[30]  Walter L. Smith Probability and Statistics , 1959, Nature.

[31]  Andrew Zisserman,et al.  Descriptor Learning Using Convex Optimisation , 2012, ECCV.

[32]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[33]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[34]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).