Learning to combine foveal glimpses with a third-order Boltzmann machine

We describe a model based on a Boltzmann machine with third-order connections that can learn how to accumulate information about a shape over several fixations. The model uses a retina that only has enough high resolution pixels to cover a small area of the image, so it must decide on a sequence of fixations and it must combine the "glimpse" at each fixation with the location of the fixation before integrating the information with information from other glimpses of the same object. We evaluate this model on a synthetic dataset and two image classification datasets, showing that it can perform at least as well as a model trained on whole images.

[1]  Hermann von Helmholtz,et al.  Treatise on Physiological Optics , 1962 .

[2]  Ethem Alpaydin,et al.  Selective Attention for Handwritten Digit Recognition , 1995, NIPS.

[3]  Rajesh P. N. Rao,et al.  Modeling Saccadic Targeting in Visual Search , 1995, NIPS.

[4]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[5]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[6]  Jitendra Malik,et al.  An Information Maximization Model of Eye Movements , 2004, NIPS.

[7]  Wei Zhang,et al.  A Computational Model of Eye Movements during Object Class Detection , 2005, NIPS.

[8]  Lucas Paletta,et al.  Q-learning of sequential attention for visual object recognition from informative local descriptors , 2005, ICML.

[9]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[10]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[11]  Geoffrey E. Hinton,et al.  Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[13]  Gary R. Bradski,et al.  Peripheral-Foveal Vision for Real-time Object Recognition and Tracking in Video , 2007, IJCAI.

[14]  Aapo Hyvärinen,et al.  A Two-Layer ICA-Like Model Estimated by Score Matching , 2007, ICANN.

[15]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[16]  Geoffrey E. Hinton Reducing the Dimensionality of Data with Neural , 2008 .

[17]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[18]  S. Grossberg,et al.  View-invariant object category learning, recognition, and search: How spatial and object attention are coordinated using surface-based attentional shrouds , 2009, Cognitive Psychology.

[19]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[20]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[21]  Garrison W. Cottrell,et al.  Robust classification of objects, faces, and flowers using natural image statistics , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[23]  Geoffrey E. Hinton Learning to represent visual input , 2010, Philosophical Transactions of the Royal Society B: Biological Sciences.