Words Matter: Scene Text for Image Classification and Retrieval

Text in natural images typically adds meaning to an object or scene. In particular, text specifies which business places serve drinks (e.g., cafe, teahouse) or food (e.g., restaurant, pizzeria), and what kind of service is provided (e.g., massage, repair). The mere presence of text, its words, and meaning are closely related to the semantics of the object or scene. This paper exploits textual contents in images for fine-grained business place classification and logo retrieval. There are four main contributions. First, we show that the textual cues extracted by the proposed method are effective for the two tasks. Combining the proposed textual and visual cues outperforms visual only classification and retrieval by a large margin. Second, to extract the textual cues, a generic and fully unsupervised word box proposal method is introduced. The method reaches state-of-the-art word detection recall with a limited number of proposals. Third, contrary to what is widely acknowledged in text detection literature, we demonstrate that high recall in word detection is more important than high f-score at least for both tasks considered in this work. Last, this paper provides a large annotated text detection dataset with 10 K images and 27 601 word boxes.

[1]  Chucai Yi,et al.  Text String Detection From Natural Scenes by Structure-Based Partition and Grouping , 2011, IEEE Transactions on Image Processing.

[2]  Theo Gevers,et al.  Object Reading: Text Recognition for Object Recognition , 2012, ECCV Workshops.

[3]  Jianfei Cai,et al.  Weakly Supervised Fine-Grained Image Categorization , 2015, ArXiv.

[4]  Yannis Avrithis,et al.  To Aggregate or Not to aggregate: Selective Match Kernels for Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Derek Hoiem,et al.  Building text features for object image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Kaizhu Huang,et al.  Robust Text Detection in Natural Scene Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  C. V. Jawahar,et al.  Top-down and bottom-up cues for scene text recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Weilin Huang,et al.  Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees , 2014, ECCV.

[11]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[12]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[13]  Jiri Matas,et al.  A Method for Text Localization and Recognition in Real-World Images , 2010, ACCV.

[14]  Tao Chen,et al.  Scene text extraction based on edges and support vector regression , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[15]  Tatiana Novikova,et al.  Large-Lexicon Attribute-Consistent Text Recognition in Natural Images , 2012, ECCV.

[16]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[17]  Pan He,et al.  Detecting Text in Natural Image with Connectionist Text Proposal Network , 2016, ECCV.

[18]  Huizhong Chen,et al.  Robust text detection in natural images with edge-enhanced Maximally Stable Extremal Regions , 2011, 2011 18th IEEE International Conference on Image Processing.

[19]  Arnold W. M. Smeulders,et al.  Fine-Grained Categorization by Alignments , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Tao Wang,et al.  End-to-end text recognition with convolutional neural networks , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[21]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tong Lu,et al.  Text Detection in Multimodal Video Analysis , 2014 .

[23]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[24]  Pietro Perona,et al.  Fast Feature Pyramids for Object Detection , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  ChenTao,et al.  Scene text extraction based on edges and support vector regression , 2015 .

[26]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[27]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Volkmar Frinken,et al.  Multimodal page classification in administrative document image streams , 2014, International Journal on Document Analysis and Recognition (IJDAR).

[29]  Dimosthenis Karatzas,et al.  TextProposals: A text-specific selective search algorithm for word spotting in the wild , 2016, Pattern Recognit..

[30]  Jiri Matas,et al.  Text Localization in Real-World Images Using Efficiently Pruned Exhaustive Search , 2011, 2011 International Conference on Document Analysis and Recognition.

[31]  Theo Gevers,et al.  Per-patch Descriptor Selection Using Surface and Scene Properties , 2012, ECCV.

[32]  Joost van de Weijer,et al.  Boosting color saliency in image feature detection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Yair Movshovitz-Attias,et al.  Ontological supervision for fine grained classification of Street View storefronts , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Wenyu Liu,et al.  A Unified Framework for Multioriented Text Detection and Recognition , 2014, IEEE Transactions on Image Processing.

[37]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[38]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[39]  C. V. Jawahar,et al.  Whole is Greater than Sum of Parts: Recognizing Scene Text Words , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[40]  C. V. Jawahar,et al.  Image Retrieval Using Textual Cues , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Xiang Bai,et al.  Symmetry-based text line detection in natural scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Olivier Buisson,et al.  Logo retrieval with a contrario visual query expansion , 2009, ACM Multimedia.

[43]  Luc Vincent,et al.  Morphological grayscale reconstruction in image analysis: applications and efficient algorithms , 1993, IEEE Trans. Image Process..

[44]  Lambert Schomaker,et al.  Text detection from natural scene images: towards a system for visually impaired persons , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[45]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[46]  Rainer Lienhart,et al.  Scalable logo recognition in real-world images , 2011, ICMR.

[47]  Yao Li,et al.  Characterness: An Indicator of Text in the Wild , 2013, IEEE Transactions on Image Processing.

[48]  David W. Jacobs,et al.  Dog Breed Classification Using Part Localization , 2012, ECCV.

[49]  Edward J. Delp,et al.  A Low Complexity Sign Detection and Text Localization Method for Mobile Applications , 2011, IEEE Transactions on Multimedia.

[50]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[51]  Qi Tian,et al.  Scale based region growing for scene text detection , 2013, ACM Multimedia.

[52]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[53]  Cordelia Schmid,et al.  Correlation-based burstiness for logo retrieval , 2012, ACM Multimedia.

[54]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[55]  Rainer Lienhart,et al.  Bundle min-hashing for logo recognition , 2013, ICMR '13.

[56]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[58]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[59]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[60]  Mei-Chen Yeh,et al.  Multimodal fusion using learned text concepts for image categorization , 2006, MM '06.

[61]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[62]  Gueesang Lee,et al.  Robust Text Detection in Natural Scene Images , 2016, Australasian Conference on Artificial Intelligence.

[63]  Alain Trémeau,et al.  Extreme value theory based text binarization in documents and natural scenes , 2010 .

[64]  Theo Gevers,et al.  Con-text: text detection using background connectivity for fine-grained object classification , 2013, ACM Multimedia.

[65]  Palaiahnakote Shivakumara,et al.  A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video , 2015, IEEE Transactions on Multimedia.

[66]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[67]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[68]  Arnold W. M. Smeulders,et al.  Locality in Generic Instance Search from One Example , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Hsueh-Cheng Wang,et al.  The Attraction of Visual Attention to Texts in Real-World Scenes: Are Chinese Texts Attractive to Non-Chinese Speakers? , 2011, CogSci.