TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pretrained with large-scale synthetic data and finetuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. The code and models will be publicly available at https://aka.ms/TrOCR.

[1]  Hermann Ney,et al.  Handwriting Recognition with Large Multidimensional Long Short-Term Memory Recurrent Neural Networks , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[2]  Siyang Qin,et al.  Rethinking Text Line Recognition Models , 2021, ArXiv.

[3]  Rowel Atienza,et al.  Vision Transformer for Fast and Efficient Scene Text Recognition , 2021, ICDAR.

[4]  Joan Puigcerver,et al.  Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[5]  Shijian Lu,et al.  ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiang Bai,et al.  ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[8]  Thomas Deselaers,et al.  A Scalable Handwritten Text Recognition System , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[9]  Bo Xu,et al.  NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition , 2018, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[10]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[11]  Seong Joon Oh,et al.  What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Maarten de Rijke,et al.  Bidirectional Scene Text Recognition with a Single Decoder , 2020, ECAI.

[13]  Furu Wei,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[14]  Seong Joon Oh,et al.  On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Xiang Bai,et al.  Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[17]  Lianwen Jin,et al.  Decoupled Attention Network for Text Recognition , 2019, AAAI.

[18]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[19]  Jason Poulos,et al.  Attention networks for image-to-text , 2017, ArXiv.

[20]  Zhi Tang,et al.  Scene Text Recognition via Gated Cascade Attention , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[21]  Rizwan Ahmed Khan,et al.  Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR) , 2020, IEEE Access.

[22]  C. V. Jawahar,et al.  Generating Synthetic Data for Text Recognition , 2016, ArXiv.

[23]  Lovekesh Vig,et al.  An Efficient End-to-End Neural Model for Handwritten Text Recognition , 2018, BMVC.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Ming Tang,et al.  Reading scene text with fully convolutional sequence modeling , 2019, Neurocomputing.

[28]  Théodore Bluche,et al.  Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition , 2016, NIPS.

[29]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[30]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[31]  Christopher Kermorvant,et al.  Dropout Improves Recurrent Neural Networks for Handwriting Recognition , 2013, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[32]  Pau Riba,et al.  Pay Attention to What You Read: Non-recurrent Handwritten Text-Line Recognition , 2020, Pattern Recognit..

[33]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[34]  Théodore Bluche,et al.  Gated Convolutional Recurrent Neural Networks for Multilingual Handwriting Recognition , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[35]  Shijian Lu,et al.  Accurate Scene Text Recognition Based on Recurrent Neural Network , 2014, ACCV.

[36]  Errui Ding,et al.  Towards Accurate Scene Text Recognition With Semantic Reasoning Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[38]  Kai Chen,et al.  Real-time Scene Text Detection with Differentiable Binarization , 2019, AAAI.

[39]  Roger Labahn,et al.  Evaluating Sequence-to-Sequence Models for Handwritten Text Recognition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[40]  Dinh Viet Sang,et al.  Improving CRNN with EfficientNet-like feature extractor and multi-head attention for text recognition , 2019, SoICT 2019.

[41]  R. Manmatha,et al.  SCATTER: Selective Context Attentional Scene Text Recognizer , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).