论文信息 - TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

Text recognition is a long-standing research problem for document digitalization. Existing approaches for text recognition are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pretrained with large-scale synthetic data and finetuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks. The code and models will be publicly available at https://aka.ms/TrOCR.

[1] Hermann Ney,et al. Handwriting Recognition with Large Multidimensional Long Short-Term Memory Recurrent Neural Networks , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[2] Siyang Qin,et al. Rethinking Text Line Recognition Models , 2021, ArXiv.

[3] Rowel Atienza,et al. Vision Transformer for Fast and Efficient Scene Text Recognition , 2021, ICDAR.

[4] Joan Puigcerver,et al. Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[5] Shijian Lu,et al. ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Xiang Bai,et al. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[8] Thomas Deselaers,et al. A Scalable Handwritten Text Recognition System , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[9] Bo Xu,et al. NRTR: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition , 2018, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[10] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[11] Seong Joon Oh,et al. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12] Maarten de Rijke,et al. Bidirectional Scene Text Recognition with a Single Decoder , 2020, ECAI.

[13] Furu Wei,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[14] Seong Joon Oh,et al. On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15] Xiang Bai,et al. Robust Scene Text Recognition with Automatic Rectification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Ilya Sutskever,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[17] Lianwen Jin,et al. Decoupled Attention Network for Text Recognition , 2019, AAAI.

[18] T. Munich,et al. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[19] Jason Poulos,et al. Attention networks for image-to-text , 2017, ArXiv.

[20] Zhi Tang,et al. Scene Text Recognition via Gated Cascade Attention , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[21] Rizwan Ahmed Khan,et al. Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR) , 2020, IEEE Access.

[22] C. V. Jawahar,et al. Generating Synthetic Data for Text Recognition , 2016, ArXiv.

[23] Lovekesh Vig,et al. An Efficient End-to-End Neural Model for Handwritten Text Recognition , 2018, BMVC.

[24] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[26] Xiang Bai,et al. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Ming Tang,et al. Reading scene text with fully convolutional sequence modeling , 2019, Neurocomputing.

[28] Théodore Bluche,et al. Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition , 2016, NIPS.

[29] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[30] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[31] Christopher Kermorvant,et al. Dropout Improves Recurrent Neural Networks for Handwriting Recognition , 2013, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[32] Pau Riba,et al. Pay Attention to What You Read: Non-recurrent Handwritten Text-Line Recognition , 2020, Pattern Recognit..

[33] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[34] Théodore Bluche,et al. Gated Convolutional Recurrent Neural Networks for Multilingual Handwriting Recognition , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[35] Shijian Lu,et al. Accurate Scene Text Recognition Based on Recurrent Neural Network , 2014, ACCV.

[36] Errui Ding,et al. Towards Accurate Scene Text Recognition With Semantic Reasoning Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[38] Kai Chen,et al. Real-time Scene Text Detection with Differentiable Binarization , 2019, AAAI.

[39] Roger Labahn,et al. Evaluating Sequence-to-Sequence Models for Handwritten Text Recognition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[40] Dinh Viet Sang,et al. Improving CRNN with EfficientNet-like feature extractor and multi-head attention for text recognition , 2019, SoICT 2019.

[41] R. Manmatha,et al. SCATTER: Selective Context Attentional Scene Text Recognizer , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).