Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages

This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.

[1]  Errui Ding,et al.  Towards Accurate Scene Text Recognition With Semantic Reasoning Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yanning Zhang,et al.  A Simple and Robust Convolutional-Attention Network for Irregular Text Recognition , 2019, ArXiv.

[3]  Seong Joon Oh,et al.  What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Maarten de Rijke,et al.  Bidirectional Scene Text Recognition with a Single Decoder , 2020, ECAI.

[5]  Xiaolin Hu,et al.  Gated Recurrent Convolution Neural Network for OCR , 2017, NIPS.

[6]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Arnold W. M. Smeulders,et al.  Words Matter: Scene Text for Image Classification and Retrieval , 2017, IEEE Transactions on Multimedia.

[9]  Dimosthenis Karatzas,et al.  Single Shot Scene Text Retrieval , 2018, ECCV.

[10]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Wei Liu,et al.  STAR-Net: A SpaTial Attention Residue Network for Scene Text Recognition , 2016, BMVC.

[12]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Theo Gevers,et al.  Con-Text: Text Detection for Fine-Grained Object Classification , 2017, IEEE Transactions on Image Processing.

[14]  Xin He,et al.  Scene Text Detection and Recognition: The Deep Learning Era , 2018, International Journal of Computer Vision.

[15]  David Yarowsky,et al.  Massively Multilingual Adversarial Speech Recognition , 2019, NAACL.

[16]  Wafa Khlif,et al.  ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition — RRC-MLT-2019 , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[17]  Shijian Lu,et al.  Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes , 2018, ECCV.

[18]  Ernest Valveny,et al.  Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[20]  Andrew Zisserman,et al.  Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[21]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Albert Gordo,et al.  Rosetta: Large Scale System for Text Detection and Recognition in Images , 2018, KDD.

[24]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[25]  Rosetta , 2018, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[26]  Ping Gong,et al.  MASTER: Multi-Aspect Non-local Network for Scene Text Recognition , 2019, Pattern Recognit..

[27]  Kai Chen,et al.  Text Recognition in Images Based on Transformer with Hierarchical Attention , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[28]  Cong Yao,et al.  UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World , 2020, CVPR 2020.

[29]  Simon Osindero,et al.  Recursive Recurrent Nets with Attention Modeling for OCR in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[31]  Jiebo Luo,et al.  Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification , 2017, IEEE Access.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.