Classifier Classifier Classifier Fine-grained classication Logo retrieval Original Image Background Reconstruction Bckg . Filtering ( Text Saliency ) Binarizing Text Saliency Character Detection Textual Cue Encoding Visual Cue Encoding Character Recognition

This work focuses on fine-grained object classification using recognized scene text in natural images. While the state-of-the-art relies on visual cues only, this paper is the first work which proposes to combine textual and visual cues. Another novelty is the textual cue extraction. Unlike the state-of-theart text detection methods, we focus more on the background instead of text regions. Once text regions are detected, they are further processed by two methods to perform text recognition i.e. ABBYY commercial OCR engine and a state-of-the-art character recognition algorithm. Then, to perform textual cue encoding, biand trigrams are formed between the recognized characters by considering the proposed spatial pairwise constraints. Finally, extracted visual and textual cues are combined for fine-grained classification. The proposed method is validated on four publicly available datasets: ICDAR03, ICDAR13, Con-Text and Flickr-logo. We improve the state-of-the-art end-to-end character recognition by a large margin of 15% on ICDAR03. We show that textual cues are useful in addition to visual cues for fine-grained classification. We show that textual cues are also useful for logo retrieval. Adding textual cues outperforms visualand textual-only in finegrained classification (70.7% to 60.3%) and logo retrieval (57.4% to 54.8%).

[1]  Jiri Matas,et al.  COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[2]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[3]  Guangyu Gao,et al.  Video Text Detection and Recognition , 2015 .

[4]  Dimosthenis Karatzas,et al.  Object proposals for text extraction in the wild , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[5]  Yair Movshovitz-Attias,et al.  Ontological supervision for fine grained classification of Street View storefronts , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tao Chen,et al.  Scene text extraction based on edges and support vector regression , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[7]  Tatiana Novikova,et al.  Fast and accurate scene text understanding with image binarization and off-the-shelf OCR , 2015, International Journal on Document Analysis and Recognition (IJDAR).

[8]  Maharashtra India,et al.  SCENE TEXT RECOGNITION IN MOBILE APPLICATIONS BY CHARACTER DESCRIPTOR AND STRUCTURE CONFIGURATION , 2015 .

[9]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[10]  Volkmar Frinken,et al.  Multimodal page classification in administrative document image streams , 2014, International Journal on Document Analysis and Recognition (IJDAR).

[11]  Dimosthenis Karatzas,et al.  Scene Text Recognition: No Country for Old Men? , 2014, ACCV Workshops.

[12]  Andrew Zisserman,et al.  Deep Features for Text Spotting , 2014, ECCV.

[13]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[14]  Wenyu Liu,et al.  A Unified Framework for Multioriented Text Detection and Recognition , 2014, IEEE Transactions on Image Processing.

[15]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[17]  Arnold W. M. Smeulders,et al.  Locality in Generic Instance Search from One Example , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jean-Philippe Domenger,et al.  Improving Classification of an Industrial Document Image Database by Combining Visual and Textual Features , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[19]  Tong Lu,et al.  Text Detection in Multimodal Video Analysis , 2014 .

[20]  Arnold W. M. Smeulders,et al.  Local Alignments for Fine-Grained Categorization , 2014, International Journal of Computer Vision.

[21]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[22]  C. V. Jawahar,et al.  Image Retrieval Using Textual Cues , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Theo Gevers,et al.  Con-text: text detection using background connectivity for fine-grained object classification , 2013, ACM Multimedia.

[24]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[25]  Josep Lladós,et al.  Towards Modelling an Attention-Based Text Localization Process , 2013, IbPRIA.

[26]  Hyung Il Koo,et al.  Scene Text Detection via Connected Component Clustering and Nontext Filtering , 2013, IEEE Transactions on Image Processing.

[27]  Rainer Lienhart,et al.  Bundle min-hashing for logo recognition , 2013, ICMR '13.

[28]  Theo Gevers,et al.  Object Reading: Text Recognition for Object Recognition , 2012, ECCV Workshops.

[29]  Jian Sun,et al.  Geodesic Saliency Using Background Priors , 2012, ECCV.

[30]  Jiřı́ Matas,et al.  Real-time scene text localization and recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Zhuowen Tu,et al.  Detecting Texts of Arbitrary Orientations in 1 Natural Images , 2012 .

[32]  Gary R. Bradski,et al.  A codebook-free and annotation-free approach for fine-grained image categorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Andreas Dengel,et al.  How Salient is Scene Text? , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[34]  Huizhong Chen,et al.  Combining image and text features: a hybrid approach to mobile book spine recognition , 2011, ACM Multimedia.

[35]  Kai Wang,et al.  End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[36]  Jiri Matas,et al.  Text Localization in Real-World Images Using Efficiently Pruned Exhaustive Search , 2011, 2011 International Conference on Document Analysis and Recognition.

[37]  Yaokai Feng,et al.  A Keypoint-Based Approach toward Scenery Character Detection , 2011, 2011 International Conference on Document Analysis and Recognition.

[38]  Chucai Yi,et al.  Text String Detection From Natural Scenes by Structure-Based Partition and Grouping , 2011, IEEE Transactions on Image Processing.

[39]  Umapada Pal,et al.  Document seal detection using GHT and character proximity graphs , 2011, Pattern Recognit..

[40]  Hsueh-Cheng Wang,et al.  The Attraction of Visual Attention to Texts in Real-World Scenes: Are Chinese Texts Attractive to Non-Chinese Speakers? , 2011, CogSci.

[41]  Rainer Lienhart,et al.  Scalable logo recognition in real-world images , 2011, ICMR.

[42]  Jan C. van Gemert,et al.  Exploiting photographic style for category-level image classification by generalizing the spatial pyramid , 2011, ICMR.

[43]  Cheng-Lin Liu,et al.  A Hybrid Approach to Detect and Localize Texts in Natural Scene Images , 2011, IEEE Transactions on Image Processing.

[44]  Nanning Zheng,et al.  Learning to Detect A Salient Object , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Alain Trémeau,et al.  Extreme value theory based text binarization in documents and natural scenes , 2010 .

[46]  Jiri Matas,et al.  A Method for Text Localization and Recognition in Real-World Images , 2010, ACCV.

[47]  Shiliang Sun,et al.  A Visual Attention Based Approach to Text Extraction , 2010, 2010 20th International Conference on Pattern Recognition.

[48]  Hovav Shacham,et al.  OpenScan: A Fully Transparent Optical Scan Voting System , 2010, EVT/WOTE.

[49]  Laurent Itti,et al.  A Bayesian model for efficient visual search and recognition , 2010, Vision Research.

[50]  Yonatan Wexler,et al.  Detecting text in natural scenes with stroke width transform , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51]  Serge J. Belongie,et al.  Context based object categorization: A critical survey , 2010, Comput. Vis. Image Underst..

[52]  Nuno Vasconcelos,et al.  Spatiotemporal Saliency in Dynamic Scenes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Olivier Buisson,et al.  Logo retrieval with a contrario visual query expansion , 2009, ACM Multimedia.

[54]  Nicu Sebe,et al.  Image saliency by isocentric curvedness and color , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[55]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[56]  Derek Hoiem,et al.  Building text features for object image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Stéphane Ayache,et al.  Classifier Fusion for SVM-Based Multimedia Semantic Indexing , 2007, ECIR.

[58]  Laurent Itti,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Rapid Biologically-inspired Scene Classification Using Features Shared with Visual Attention , 2022 .

[59]  Mei-Chen Yeh,et al.  Multimodal fusion using learned text concepts for image categorization , 2006, MM '06.

[60]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[61]  Andrew Zisserman,et al.  A Visual Vocabulary for Flower Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[62]  Ioannis Pratikakis,et al.  Adaptive degraded document image binarization , 2006, Pattern Recognit..

[63]  Berna Erol,et al.  Semantic classification of business images , 2006, Electronic Imaging.

[64]  Joost van de Weijer,et al.  Boosting color saliency in image feature detection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Ehud Rivlin,et al.  Applying algebraic and differential invariants for logo recognition , 1996, Machine Vision and Applications.

[66]  Lambert Schomaker,et al.  Text detection from natural scene images: towards a system for visually impaired persons , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[67]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[68]  Gang Wang,et al.  TRECVID 2004 Search and Feature Extraction Task by NUS PRIS , 2004, TRECVID.

[69]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..