Enriched Deep Recurrent Visual Attention Model for Multiple Object Recognition

We design an Enriched Deep Recurrent Visual Attention Model (EDRAM) — an improved attention-based architecture for multiple object recognition. The proposed model is a fully differentiable unit that can be optimized end-to-end by using Stochastic Gradient Descent (SGD). The Spatial Transformer (ST) was employed as visual attention mechanism which allows to learn the geometric transformation of objects within images. With the combination of the Spatial Transformer and the powerful recurrent architecture, the proposed EDRAM can localize and recognize objects simultaneously. EDRAM has been evaluated on two publicly available datasets including MNIST Cluttered (with 70K cluttered digits) and SVHN (with up to 250k real world images of house numbers). Experiments show that it obtains superior performance as compared with the state-of-the-art models.

[1]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[2]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[3]  Koray Kavukcuoglu,et al.  Multiple Object Recognition with Visual Attention , 2014, ICLR.

[4]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[6]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[7]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[8]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[9]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[12]  Yaroslav Bulatov,et al.  Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks , 2013, ICLR.

[13]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[14]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[15]  Ole Winther,et al.  Recurrent Spatial Transformer Networks , 2015, ArXiv.

[16]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Yoshua Bengio,et al.  Blocks and Fuel: Frameworks for deep learning , 2015, ArXiv.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Hugo Larochelle,et al.  Dynamic Capacity Networks , 2015, ICML.

[21]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[22]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.