论文信息 - MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available at https://aka.ms/ markuplm.

Furu Wei | Lei Cui | Junlong Li | Yiheng Xu

[1] BROS: A PRE-TRAINED LANGUAGE MODEL , 2020 .

[2] Furu Wei,et al. LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[3] Furu Wei,et al. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[5] Qiang Hao,et al. From one tree to a forest: a unified solution for structured web data extraction , 2011, SIGIR.

[6] Konstantinos G. Derpanis,et al. Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[7] Fei Huang,et al. StructuralLM: Structural Pre-training for Form Understanding , 2021, ACL.

[8] Bill Yuchen Lin,et al. FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents , 2020, KDD.

[9] Przemyslaw Biecek,et al. Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout , 2020, ArXiv.

[10] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11] Shashank Mujumdar,et al. Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning , 2020, ArXiv.

[12] Seunghyun Park,et al. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing , 2019 .

[13] Zheng Huang,et al. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[14] Hongfu Liu,et al. SelfDoc: Self-Supervised Document Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Sandeep Tata,et al. Simplified DOM Trees for Transferable Attribute Extraction from the Web , 2021, ArXiv.

[16] Lu Chen,et al. WebSRC: A Dataset for Web-Based Structural Reading Comprehension , 2021, Conference on Empirical Methods in Natural Language Processing.

[17] David Weir,et al. Leveraging HTML in Free Text Web Named Entity Recognition , 2020, COLING.

[18] Lukasz Garncarek,et al. LAMBERT: Layout-Aware Language Modeling for Information Extraction , 2020, ICDAR.

[19] Bhargava Urala Kota,et al. DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Tomasz Dwojak,et al. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[21] Charles Schafer,et al. Bootstrapping Information Extraction from Semi-structured Web Pages , 2008, ECML/PKDD.

[22] Michael Bendersky,et al. LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding , 2021, ArXiv.

[23] Jean-Philippe Thiran,et al. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[24] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[26] Shlomo Argamon,et al. Building a test collection for complex document information processing , 2006, SIGIR.

[27] C. V. Jawahar,et al. DocVQA: A Dataset for VQA on Document Images , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28] Furu Wei,et al. LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding , 2021, ArXiv.