StructuralLM: Structural Pre-training for Form Understanding

Large pre-trained language models achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, they almost exclusively focus on text-only representation, while neglecting cell-level layout information that is important for form image understanding. In this paper, we propose a new pre-training approach, StructuralLM, to jointly leverage cell and layout information from scanned documents. Specifically, we pre-train StructuralLM with two new designs to make the most of the interactions of cell and layout information: 1) each cell as a semantic unit; 2) classification of cell positions. The pre-trained StructuralLM achieves new state-of-the-art results in different types of downstream tasks, including form understanding (from 78.95 to 85.14), document visual question answering (from 72.59 to 83.94) and document image classification (from 94.43 to 96.08).

[1]  C. V. Jawahar,et al.  DocVQA: A Dataset for VQA on Document Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[2]  Shiliang Pu,et al.  TRIE: End-to-End Text Reading and Information Extraction for Document Understanding , 2020, ACM Multimedia.

[3]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[4]  Mohammad Mehdi Rashidi,et al.  Modular Multimodal Architecture for Document Classification , 2019, ArXiv.

[5]  Shinjae Yoo,et al.  Visual Detection with Context for Document Layout Analysis , 2019, EMNLP.

[6]  Luo Si,et al.  StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding , 2019, ICLR.

[7]  Arnab Nandi,et al.  Deterministic Routing between Layout Abstractions for Multi-Scale Classification of Visually Rich Documents , 2019, IJCAI.

[8]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[9]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[10]  Wei Wang,et al.  Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering , 2018, ACL.

[11]  Steffen Bickel,et al.  Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[12]  Ujjwal Bhattacharya,et al.  Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[13]  Matheus Palhares Viana,et al.  Fast CNN-Based Document Layout Analysis , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[14]  Ersin Yumer,et al.  Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Marcus Liwicki,et al.  Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[16]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[17]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[18]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[19]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[20]  Rolf Ingold,et al.  Evaluation of SVM, MLP and GMM Classifiers for Layout Analysis of Historical Documents , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[21]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[22]  Paul A. Viola,et al.  Learning nongenerative grammatical models for document analysis , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  BROS: A PRE-TRAINED LANGUAGE MODEL , 2020 .

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Giovanni Soda,et al.  Artificial neural networks for document analysis and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.