Attention-Based Extraction of Structured Information from Street View Imagery

We present a neural network model — based on Convolutional Neural Networks, Recurrent Neural Networks and a novel attention mechanism — which achieves 84.2% accuracy on the challenging French Street Name Signs (FSNS) dataset, significantly outperforming the previous state of the art (Smith'16), which achieved 72.46%. Furthermore, our new method is much simpler and more general than the previous approach. To demonstrate the generality of our model, we show that it also performs well on an even more challenging dataset derived from Google Street View, in which the goal is to extract business names from store fronts. Finally, we study the speed/accuracy tradeoff that results from using CNN feature extractors of different depths. Surprisingly, we find that deeper is not always better (in terms of accuracy, as well as speed). Our resulting model is simple, accurate and fast, allowing it to be used at scale on a variety of challenging real-world text extraction problems.

[1]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[5]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[6]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[8]  Christian Szegedy,et al.  Large Scale Business Discovery from Street Level Imagery , 2015, ArXiv.

[9]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[10]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[11]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[12]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[13]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[14]  Takahiro Matsuda,et al.  Downtown Osaka Scene Text Dataset , 2016, ECCV Workshops.

[15]  Dale Schuurmans,et al.  Reward Augmented Maximum Likelihood for Neural Structured Prediction , 2016, NIPS.

[16]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Théodore Bluche,et al.  Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition , 2016, NIPS.

[19]  Ranjith Unnikrishnan,et al.  End-to-End Interpretation of the French Street Name Signs Dataset , 2016, ECCV Workshops.

[20]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[21]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jiri Matas,et al.  COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[23]  Weilin Huang,et al.  Text-Attentional Convolutional Neural Network for Scene Text Detection , 2015, IEEE Transactions on Image Processing.

[24]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[27]  Jérôme Louradour,et al.  Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention , 2016, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).