论文信息 - DocFormerv2: Local Features for Document Understanding

DocFormerv2: Local Features for Document Understanding

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine datasets shows state-of-the-art performance over strong baselines e.g. TabFact (4.3%), InfoVQA (1.4%), FUNSD (1%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, Doc- Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLi and Flamingo) on some tasks. Extensive ablations show that due to its pre-training, DocFormerv2 understands multiple modalities better than prior-art in VDU.

[1] Mu Li,et al. MixGen: A New Multi-Modal Data Augmentation , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW).

[2] Feiqi Cao,et al. SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering , 2022, ArXiv.

[3] Mohit Bansal,et al. Unifying Vision, Text, and Layout for Universal Document Processing , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Wei Wei,et al. A Benchmark for Structured Extractions from Complex Documents , 2022, ArXiv.

[5] N. Vasconcelos,et al. YORO - Lightweight End to End Visual Grounding , 2022, ECCV Workshops.

[6] Hua Wu,et al. ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding , 2022, EMNLP.

[7] Julian Martin Eisenschlos,et al. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding , 2022, ArXiv.

[8] Furu Wei,et al. XDoc: Unified Pre-training for Cross-Format Document Understanding , 2022, EMNLP.

[9] Ashish V. Thapliyal,et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[10] Radu Soricut,et al. PreSTU: Pre-Training for Scene-Text Understanding , 2022, ArXiv.

[11] Ramprasaath R. Selvaraju,et al. TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation , 2022, BMVC.

[12] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[13] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[14] Vlad I. Morariu,et al. Unified Pretraining Framework for Document Understanding , 2022, ArXiv.

[15] Furu Wei,et al. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , 2022, ACM Multimedia.

[16] Vlad I. Morariu,et al. End-to-end Document Recognition and Understanding with Dessurt , 2022, ECCV Workshops.

[17] Yu Zhou,et al. Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering , 2022, 2022 IEEE International Conference on Multimedia and Expo (ICME).

[18] Nan Hua,et al. FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction , 2022, ACL.

[19] Liqing Zhang,et al. XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Lianwen Jin,et al. LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding , 2022, ACL.

[21] Ives Macêdo,et al. SeeTek: Very Large-Scale Open-set Logo Recognition with Text-Aware Metric Learning , 2022, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[22] Srikar Appalaraju,et al. LaTr: Layout-Aware Transformer for Scene-Text VQA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Dongyoon Han,et al. OCR-Free Document Understanding Transformer , 2021, ECCV.

[24] Li Dong,et al. Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Furu Wei,et al. MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding , 2021, ACL.

[26] Jean Oh,et al. Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[27] Errui Ding,et al. StrucTexT: Structured Text Understanding with Multi-Modal Transformers , 2021, ACM Multimedia.

[28] Bhargava Urala Kota,et al. DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29] Hongfu Liu,et al. SelfDoc: Self-Supervised Document Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[31] Tomasz Dwojak,et al. Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[32] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[33] Cha Zhang,et al. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[34] Jiebo Luo,et al. TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[36] C. V. Jawahar,et al. DocVQA: A Dataset for VQA on Document Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37] Seunghyun Park,et al. Spatial Dependency Parsing for Semi-Structured Document Information Extraction , 2020, FINDINGS.

[38] M. Turski,et al. DUE: End-to-End Document Understanding Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[39] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Wei Han,et al. Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering , 2020, COLING.

[41] Yusheng Xie,et al. Towards Good Practices in Self-supervised Representation Learning , 2020, ArXiv.

[42] A. Schwing,et al. Spatially Aware Multimodal Transformers for TextVQA , 2020, ECCV.

[43] R. Manmatha,et al. SCATTER: Selective Context Attentional Scene Text Recognizer , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Jianfeng Gao,et al. UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[45] Furu Wei,et al. LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[46] Trevor Darrell,et al. Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[48] Wenhu Chen,et al. TabFact: A Large-scale Dataset for Table-based Fact Verification , 2019, ICLR.

[49] BROS: A PRE-TRAINED LANGUAGE MODEL , 2020 .

[50] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[51] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[52] Seunghyun Park,et al. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing , 2019 .

[53] Shashank Shekhar,et al. OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[54] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[55] Ernest Valveny,et al. ICDAR 2019 Competition on Scene Text Visual Question Answering , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[56] Ernest Valveny,et al. Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57] Jean-Philippe Thiran,et al. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[58] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[61] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[64] Konstantinos G. Derpanis,et al. Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).