Learning Generative Models with Visual Attention

Attention has long been proposed by psychologists to be important for efficiently dealing with the massive amounts of sensory stimulus in the neocortex. Inspired by the attention models in visual neuroscience and the need for object-centered data for generative models, we propose a deep-learning based generative framework using attention. The attentional mechanism propagates signals from the region of interest in a scene to an aligned canonical representation for generative modeling. By ignoring scene background clutter, the generative model can concentrate its resources on the object of interest. A convolutional neural net is employed to provide good initializations during posterior inference which uses Hamiltonian Monte Carlo. Upon learning images of faces, our model can robustly attend to the face region of novel test subjects. More importantly, our model can learn generative models of new faces from a novel dataset of large images where the face locations are not known.

[1]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[2]  D C Van Essen,et al.  Shifter circuits: a computational strategy for dynamic aspects of visual processing. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[3]  A. Kennedy,et al.  Hybrid Monte Carlo , 1987 .

[4]  D. V. van Essen,et al.  A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[5]  John K. Tsotsos,et al.  Modeling Visual Attention via Selective Tuning , 1995, Artif. Intell..

[6]  Geoffrey E. Hinton,et al.  Modeling the manifolds of images of handwritten digits , 1997, IEEE Trans. Neural Networks.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8]  C. Gilbert,et al.  Attention and primary visual cortex. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  N. Kanwisher,et al.  Visual attention: Insights from brain imaging , 2000, Nature Reviews Neuroscience.

[10]  Laurenz Wiskott,et al.  How Does Our Visual System Achieve Shift and Size Invariance , 2004 .

[11]  Simon Baker,et al.  Lucas-Kanade 20 Years On: A Unifying Framework , 2004, International Journal of Computer Vision.

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[14]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[15]  Takeo Kanade,et al.  Multi-PIE , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[16]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[17]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[18]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[19]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[20]  R. Desimone,et al.  A backward progression of attentional effects in the ventral stream , 2009, Proceedings of the National Academy of Sciences.

[21]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[22]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[23]  T. Poggio,et al.  What and where: A Bayesian inference theory of attention , 2010, Vision Research.

[24]  J. P. Lewis Fast Normalized Cross-Correlation , 2010 .

[25]  Richard Szeliski,et al.  Computer Vision - Algorithms and Applications , 2011, Texts in Computer Science.

[26]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[27]  Yair Weiss,et al.  From learning models of natural image patches to whole image restoration , 2011, 2011 International Conference on Computer Vision.

[28]  Peggy Seriès,et al.  A Hierarchical Generative Model of Recurrent Object-Based Attention in the Visual Cortex , 2011, ICANN.

[29]  Geoffrey E. Hinton,et al.  On deep generative models with applications to recognition , 2011, CVPR 2011.

[30]  Misha Denil,et al.  Learning Where to Attend with Deep Architectures for Image Tracking , 2011, Neural Computation.

[31]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Yee Whye Teh,et al.  Searching for objects driven by context , 2012, NIPS.

[33]  Geoffrey E. Hinton,et al.  Deep Mixtures of Factor Analysers , 2012, ICML.

[34]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[35]  Marc'Aurelio Ranzato,et al.  On Learning Where To Look , 2014, ArXiv.

[36]  Peter Kulchyski and , 2015 .