Unified Pretraining Framework for Document Understanding

Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions towards reducing annotation efforts by training models with self-supervised objectives. However, most of the existing document pretraining methods are still language-dominated. We present UniDoc, a new unified pretraining framework for document understanding. UniDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. Each input element is composed of words and visual features from a semantic region of the input document image. An important feature of UniDoc is that it learns a generic representation by making use of three self-supervised losses, encourag-ing the representation to model sentences, learn similarities, and align modalities. Extensive empirical analysis demonstrates that the pretraining procedure learns better joint representations and leads to improvements in downstream tasks.

[1]  Hongfu Liu,et al.  SelfDoc: Self-Supervised Document Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Tomasz Dwojak,et al.  Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[3]  Cha Zhang,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[4]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[5]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[6]  Seunghyun Park,et al.  CORD: A Consolidated Receipt Dataset for Post-OCR Parsing , 2019 .

[7]  Omri Ben-Eliezer,et al.  READ: Recursive Autoencoders for Document Layout Generation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[9]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[10]  Antonio Jimeno-Yepes,et al.  PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[11]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[12]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[13]  Greg Mori,et al.  Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[15]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[16]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[17]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[19]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[20]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[21]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[22]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[23]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[24]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[25]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Anthony W. Kay,et al.  Tesseract: an open-source optical character recognition engine , 2007 .

[29]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[30]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[31]  Shafiq R. Joty,et al.  Self-Supervised Relationship Probing , 2020, NeurIPS.

[32]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.