论文信息 - Data-Efficient Information Extraction from Documents with Pre-trained Language Models

Data-Efficient Information Extraction from Documents with Pre-trained Language Models

Like for many text understanding and generation tasks, pretrained languages models have emerged as a powerful approach for extracting information from business documents. However, their performance has not been properly studied in data-constrained settings which are often encountered in industrial applications. In this paper, we show that LayoutLM, a pre-trained model recently proposed for encoding 2D documents, reveals a high sample-efficiency when fine-tuned on public and real-world Information Extraction (IE) datasets. Indeed, LayoutLM reaches more than 80% of its full performance with as few as 32 documents for fine-tuning. When compared with a strong baseline learning IE from scratch, the pre-trained model needs between 4 to 30 times fewer annotated documents in the toughest data conditions. Finally, LayoutLM performs better on the real-world dataset when having been beforehand fine-tuned on the full public dataset, thus indicating valuable knowledge transfer abilities. We therefore advocate the use of pre-trained language models for tackling practical extraction problems.

[1] Zheng Huang,et al. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[2] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[3] Francesca Cesarini,et al. INFORMys: A Flexible Invoice-Like Form-Reader System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[5] Furu Wei,et al. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[6] Xiaohui Zhao,et al. CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor , 2019, ArXiv.

[7] Ole Winther,et al. Attend, Copy, Parse End-to-end Information Extraction from Documents , 2018, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[8] Qiong Zhang,et al. Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models , 2020, SIGIR.

[9] Shashank Mujumdar,et al. Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning , 2020, ArXiv.

[10] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[11] Ping Gong,et al. PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks , 2020, ArXiv.

[12] Konstantinos G. Derpanis,et al. Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[13] Frederick Reiss,et al. Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[14] Furu Wei,et al. LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[15] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16] Christian Reisswig,et al. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding , 2019, ArXiv.

[17] Mitchell P. Marcus,et al. Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[18] Omer Levy,et al. Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[19] Shlomo Argamon,et al. Building a test collection for complex document information processing , 2006, SIGIR.

[20] Véronique Eglin,et al. End-to-End Extraction of Structured Information from Business Documents with Pointer-Generator Networks , 2020, SPNLP.

[21] BROS: A PRE-TRAINED LANGUAGE MODEL , 2020 .

[22] Tuan Anh Nguyen Dang,et al. End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net , 2019, BMVC.

[23] Jonathan Berant,et al. Question Answering is a Format; When is it Useful? , 2019, ArXiv.

[24] Sriram Raghavan,et al. Regular Expression Learning for Information Extraction , 2008, EMNLP.

[25] Xipeng Qiu,et al. Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[26] David Yarowsky,et al. Techniques in Speech Acoustics , 1999, Computational Linguistics.

[27] Hamid Motahari,et al. A Report on the First Workshop on Document Intelligence (DI) at NeurIPS 2019 , 2020, SIGKDD Explor..

[28] Ole Winther,et al. CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[29] C. V. Jawahar,et al. DocVQA: A Dataset for VQA on Document Images , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30] Roy Shilkrot,et al. Cardinal Graph Convolution Framework for Document Information Extraction , 2020, DocEng.

[31] Kilian Q. Weinberger,et al. Revisiting Few-sample BERT Fine-tuning , 2020, ArXiv.

[32] Jean-Philippe Thiran,et al. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[33] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34] Zhiyu Chen,et al. Few-shot NLG with Pre-trained Language Model , 2020, ACL.

[35] Seunghyun Park,et al. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing , 2019 .

[36] Regina Barzilay,et al. GraphIE: A Graph-Based Framework for Information Extraction , 2018, NAACL.

[37] Véronique Eglin,et al. Recurrent Neural Network Approach for Table Field Extraction in Business Documents , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[38] Steffen Bickel,et al. Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[39] Xiaojing Liu,et al. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents , 2019, NAACL.