MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available at https://aka.ms/ markuplm.

[1]  BROS: A PRE-TRAINED LANGUAGE MODEL , 2020 .

[2]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[3]  Furu Wei,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Qiang Hao,et al.  From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[6]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[7]  Fei Huang,et al.  StructuralLM: Structural Pre-training for Form Understanding , 2021, ACL.

[8]  Bill Yuchen Lin,et al.  FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents , 2020, KDD.

[9]  Przemyslaw Biecek,et al.  Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout , 2020, ArXiv.

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Shashank Mujumdar,et al.  Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning , 2020, ArXiv.

[12]  Seunghyun Park,et al.  CORD: A Consolidated Receipt Dataset for Post-OCR Parsing , 2019 .

[13]  Zheng Huang,et al.  ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[14]  Hongfu Liu,et al.  SelfDoc: Self-Supervised Document Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Sandeep Tata,et al.  Simplified DOM Trees for Transferable Attribute Extraction from the Web , 2021, ArXiv.

[16]  Lu Chen,et al.  WebSRC: A Dataset for Web-Based Structural Reading Comprehension , 2021, Conference on Empirical Methods in Natural Language Processing.

[17]  David Weir,et al.  Leveraging HTML in Free Text Web Named Entity Recognition , 2020, COLING.

[18]  Lukasz Garncarek,et al.  LAMBERT: Layout-Aware Language Modeling for Information Extraction , 2020, ICDAR.

[19]  Bhargava Urala Kota,et al.  DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Tomasz Dwojak,et al.  Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[21]  Charles Schafer,et al.  Bootstrapping Information Extraction from Semi-structured Web Pages , 2008, ECML/PKDD.

[22]  Michael Bendersky,et al.  LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding , 2021, ArXiv.

[23]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[26]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[27]  C. V. Jawahar,et al.  DocVQA: A Dataset for VQA on Document Images , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28]  Furu Wei,et al.  LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding , 2021, ArXiv.