Textual Visual Semantic Dataset for Text Spotting

Text Spotting in the wild consists of detecting and recognizing text appearing in images (e.g. signboards, traffic signals or brands in clothing or objects). This is a challenging problem due to the complexity of the context where texts appear (uneven backgrounds, shading, occlusions, perspective distortions, etc.). Only a few approaches try to exploit the relation between text and its surrounding environment to better recognize text in the scene. In this paper, we propose a visual context dataset 1 for Text Spotting in the wild, where the publicly available dataset COCO-text [40] has been extended with information about the scene (such as objects and places appearing in the image) to enable researchers to include semantic relations between texts and scene in their Text Spotting systems, and to offer a common framework for such approaches. For each text in an image, we extract three kinds of context information: objects in the scene, image location label and a textual image description (caption). We use state-of-the-art out-of-the-box available tools to extract this additional information. Since this information has textual form, it can be used to leverage text similarity or semantic relation methods into Text Spotting systems, either as a post-processing or in an end-to-end training strategy.

[1]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[2]  Dwi H. Widyantoro,et al.  Levensthein distance as a post-process to improve the performance of OCR in written road signs , 2017, 2017 Second International Conference on Informatics and Computing (ICIC).

[3]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4]  Jiri Matas,et al.  COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , 2016, ArXiv.

[5]  Daniel N. Osherson,et al.  Probability from similarity , 2003 .

[6]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[7]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Seiichi Uchida,et al.  Could scene context be beneficial for scene text detection? , 2016, Pattern Recognit..

[9]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[12]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[13]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[14]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ernest Valveny,et al.  Visual Attention Models for Scene Text Recognition , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[16]  Nigel Collier,et al.  De-Conflated Semantic Representations , 2016, EMNLP.

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Shitala Prasad,et al.  Using Object Information for Spotting Text , 2018, ECCV.

[19]  Francesc Moreno-Noguer,et al.  Semantic Relatedness Based Re-ranker for Text Spotting , 2019, EMNLP/IJCNLP.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Ignacio Iacobacci,et al.  SensEmbed: Learning Sense Embeddings for Word and Relational Similarity , 2015, ACL.

[22]  Ignacio Iacobacci,et al.  LSTMEmbed: Learning Word and Sense Representations from a Large Semantically Annotated Corpus with Long Short-Term Memories , 2019, ACL.

[23]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Suk I. Yoo,et al.  Detection and Recognition of Text Embedded in Online Images via Neural Context Models , 2017, AAAI.

[26]  Hartmut Neven,et al.  PhotoOCR: Reading Text in Uncontrolled Conditions , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ignacio Iacobacci,et al.  Embedding Words and Senses Together via Joint Knowledge-Enhanced Training , 2016, CoNLL.

[29]  Steven Schockaert,et al.  Relational Word Embeddings , 2019, ACL.

[30]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[31]  Francesc Moreno-Noguer,et al.  Visual Re-ranking with Natural Language Understanding for Text Spotting , 2018, ACCV.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Christos Liambas,et al.  Autonomous OCR dictating system for blind people , 2016, 2016 IEEE Global Humanitarian Technology Conference (GHTC).

[34]  Yongdong Zhang,et al.  Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling , 2018, ACM Multimedia.

[35]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[37]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[38]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Kai Wang,et al.  Word Spotting in the Wild , 2010, ECCV.

[40]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[41]  Jiri Matas,et al.  ICDAR2017 Robust Reading Challenge on COCO-Text , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[42]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[43]  Roberto Navigli,et al.  Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia , 2016, IJCAI.

[44]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[45]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[46]  Francesc Moreno-Noguer,et al.  BreakingNews: Article Annotation by Image and Text Processing , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.