Understanding deep image representations by inverting them

Image representations, from SIFT and Bag of Visual Words to Convolutional Neural Networks (CNNs), are a crucial component of almost any image understanding system. Nevertheless, our understanding of them remains limited. In this paper we conduct a direct analysis of the visual information contained in representations by asking the following question: given an encoding of an image, to which extent is it possible to reconstruct the image itself? To answer this question we contribute a general framework to invert representations. We show that this method can invert representations such as HOG more accurately than recent alternatives while being applicable to CNNs too. We then use this technique to study the inverse of recent state-of-the-art CNN image representations for the first time. Among our findings, we show that several layers in CNNs retain photographically accurate information about the image, with different degrees of geometric and photometric invariance.

[1]  A. Linden,et al.  Inversion of multilayer nets , 1989, International 1989 Joint Conference on Neural Networks.

[2]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[3]  Sukhan Lee,et al.  Inverse mapping of continuous functions using local and global information , 1994, IEEE Trans. Neural Networks.

[4]  Robert J. Marks,et al.  Inversion of feedforward neural networks: algorithms and applications , 1999, Proc. IEEE.

[5]  Hajime Kita,et al.  Inverting feedforward neural networks using linear and nonlinear programming , 1999, IEEE Trans. Neural Networks.

[6]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[7]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[9]  Jitendra Malik,et al.  Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.

[10]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[11]  Annamária R. Várkonyi-Kóczy,et al.  Observer Based Iterative Neural Network Model Inversion , 2005, The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ '05..

[12]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[14]  Frédéric Jurie,et al.  Sampling Strategies for Bag-of-Features Image Classification , 2006, ECCV.

[15]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  A. Vedaldi An open implementation of the SIFT detector and descriptor , 2007 .

[17]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[18]  Rob Fergus,et al.  Fast Image Deconvolution using Hyper-Laplacian Priors , 2009, NIPS.

[19]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[23]  Thomas S. Huang,et al.  Supervised translation-invariant sparse coding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Benjamin B. Kimia,et al.  Exploring the representation capabilities of the HOG descriptor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[25]  Patrick Pérez,et al.  Reconstructing an image from its local descriptors , 2011, CVPR 2011.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Pierre Vandergheynst,et al.  Beyond bits: Reconstructing images from Local Binary Descriptors , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[28]  Antonio Torralba,et al.  HOGgles: Visualizing Object Detection Features , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[30]  Tatsuya Harada,et al.  Image Reconstruction from Bag-of-Visual-Words , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  R. Fergus,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[32]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[33]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[34]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[35]  Yunjin Chen,et al.  A bi-level view of inpainting - based image compression , 2014, ArXiv.

[36]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[37]  Antonio Torralba,et al.  Visualizing Object Detection Features , 2015, International Journal of Computer Vision.