LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. In this paper, we propose the LayoutLM to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents. Furthermore, we also leverage image features to incorporate words' visual information into LayoutLM. To the best of our knowledge, this is the first time that text and layout are jointly learned in a single framework for document-level pre-training. It achieves new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42). The code and pre-trained LayoutLM models are publicly available at https://aka.ms/layoutlm.

[1]  Marcus Liwicki,et al.  Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[2]  Matheus Palhares Viana,et al.  Fast CNN-Based Document Layout Analysis , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Giovanni Soda,et al.  Artificial neural networks for document analysis and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Shinjae Yoo,et al.  Visual Detection with Context for Document Layout Analysis , 2019, EMNLP.

[7]  Robert M. Haralick,et al.  Document page decomposition by the bounding-box project , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[8]  H. Emptoz,et al.  A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[9]  Ujjwal Bhattacharya,et al.  Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[10]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[11]  Arnab Nandi,et al.  Deterministic Routing between Layout Abstractions for Multi-Scale Classification of Visually Rich Documents , 2019, IJCAI.

[12]  A. Peter Johnson,et al.  A Fast Algorithm for Bottom-Up Document Layout Analysis , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Steffen Bickel,et al.  Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[17]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[18]  Rolf Ingold,et al.  Evaluation of SVM, MLP and GMM Classifiers for Layout Analysis of Historical Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[19]  Mohammad Mehdi Rashidi,et al.  Modular Multimodal Architecture for Document Classification , 2019, ArXiv.

[20]  Antonio Jimeno-Yepes,et al.  PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[21]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[22]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[23]  Ersin Yumer,et al.  Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Paul A. Viola,et al.  Learning nongenerative grammatical models for document analysis , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[25]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[26]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Zhi Tang,et al.  A Table Detection Method for PDF Documents Based on Convolutional Neural Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[28]  Xiaojing Liu,et al.  Graph Convolution for Multimodal Information Extraction from Visually Rich Documents , 2019, NAACL.

[29]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[30]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.