Data-Efficient Information Extraction from Documents with Pre-trained Language Models

Like for many text understanding and generation tasks, pretrained languages models have emerged as a powerful approach for extracting information from business documents. However, their performance has not been properly studied in data-constrained settings which are often encountered in industrial applications. In this paper, we show that LayoutLM, a pre-trained model recently proposed for encoding 2D documents, reveals a high sample-efficiency when fine-tuned on public and real-world Information Extraction (IE) datasets. Indeed, LayoutLM reaches more than 80% of its full performance with as few as 32 documents for fine-tuning. When compared with a strong baseline learning IE from scratch, the pre-trained model needs between 4 to 30 times fewer annotated documents in the toughest data conditions. Finally, LayoutLM performs better on the real-world dataset when having been beforehand fine-tuned on the full public dataset, thus indicating valuable knowledge transfer abilities. We therefore advocate the use of pre-trained language models for tackling practical extraction problems.

[1]  Zheng Huang,et al.  ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[2]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[3]  Francesca Cesarini,et al.  INFORMys: A Flexible Invoice-Like Form-Reader System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[5]  Furu Wei,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[6]  Xiaohui Zhao,et al.  CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor , 2019, ArXiv.

[7]  Ole Winther,et al.  Attend, Copy, Parse End-to-end Information Extraction from Documents , 2018, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[8]  Qiong Zhang,et al.  Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models , 2020, SIGIR.

[9]  Shashank Mujumdar,et al.  Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning , 2020, ArXiv.

[10]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[11]  Ping Gong,et al.  PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks , 2020, ArXiv.

[12]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[13]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[14]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[15]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16]  Christian Reisswig,et al.  BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding , 2019, ArXiv.

[17]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[18]  Omer Levy,et al.  Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[19]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[20]  Véronique Eglin,et al.  End-to-End Extraction of Structured Information from Business Documents with Pointer-Generator Networks , 2020, SPNLP.

[21]  BROS: A PRE-TRAINED LANGUAGE MODEL , 2020 .

[22]  Tuan Anh Nguyen Dang,et al.  End-to-End Information Extraction by Character-Level Embedding and Multi-Stage Attentional U-Net , 2019, BMVC.

[23]  Jonathan Berant,et al.  Question Answering is a Format; When is it Useful? , 2019, ArXiv.

[24]  Sriram Raghavan,et al.  Regular Expression Learning for Information Extraction , 2008, EMNLP.

[25]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[26]  David Yarowsky,et al.  Techniques in Speech Acoustics , 1999, Computational Linguistics.

[27]  Hamid Motahari,et al.  A Report on the First Workshop on Document Intelligence (DI) at NeurIPS 2019 , 2020, SIGKDD Explor..

[28]  Ole Winther,et al.  CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[29]  C. V. Jawahar,et al.  DocVQA: A Dataset for VQA on Document Images , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30]  Roy Shilkrot,et al.  Cardinal Graph Convolution Framework for Document Information Extraction , 2020, DocEng.

[31]  Kilian Q. Weinberger,et al.  Revisiting Few-sample BERT Fine-tuning , 2020, ArXiv.

[32]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Zhiyu Chen,et al.  Few-shot NLG with Pre-trained Language Model , 2020, ACL.

[35]  Seunghyun Park,et al.  CORD: A Consolidated Receipt Dataset for Post-OCR Parsing , 2019 .

[36]  Regina Barzilay,et al.  GraphIE: A Graph-Based Framework for Information Extraction , 2018, NAACL.

[37]  Véronique Eglin,et al.  Recurrent Neural Network Approach for Table Field Extraction in Business Documents , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[38]  Steffen Bickel,et al.  Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[39]  Xiaojing Liu,et al.  Graph Convolution for Multimodal Information Extraction from Visually Rich Documents , 2019, NAACL.