DocFormerv2: Local Features for Document Understanding

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine datasets shows state-of-the-art performance over strong baselines e.g. TabFact (4.3%), InfoVQA (1.4%), FUNSD (1%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, Doc- Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLi and Flamingo) on some tasks. Extensive ablations show that due to its pre-training, DocFormerv2 understands multiple modalities better than prior-art in VDU.

[1]  Mu Li,et al.  MixGen: A New Multi-Modal Data Augmentation , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW).

[2]  Feiqi Cao,et al.  SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering , 2022, ArXiv.

[3]  Mohit Bansal,et al.  Unifying Vision, Text, and Layout for Universal Document Processing , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Wei Wei,et al.  A Benchmark for Structured Extractions from Complex Documents , 2022, ArXiv.

[5]  N. Vasconcelos,et al.  YORO - Lightweight End to End Visual Grounding , 2022, ECCV Workshops.

[6]  Hua Wu,et al.  ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding , 2022, EMNLP.

[7]  Julian Martin Eisenschlos,et al.  Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding , 2022, ArXiv.

[8]  Furu Wei,et al.  XDoc: Unified Pre-training for Cross-Format Document Understanding , 2022, EMNLP.

[9]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[10]  Radu Soricut,et al.  PreSTU: Pre-Training for Scene-Text Understanding , 2022, ArXiv.

[11]  Ramprasaath R. Selvaraju,et al.  TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation , 2022, BMVC.

[12]  Zhe Gan,et al.  GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[13]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[14]  Vlad I. Morariu,et al.  Unified Pretraining Framework for Document Understanding , 2022, ArXiv.

[15]  Furu Wei,et al.  LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , 2022, ACM Multimedia.

[16]  Vlad I. Morariu,et al.  End-to-end Document Recognition and Understanding with Dessurt , 2022, ECCV Workshops.

[17]  Yu Zhou,et al.  Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering , 2022, 2022 IEEE International Conference on Multimedia and Expo (ICME).

[18]  Nan Hua,et al.  FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction , 2022, ACL.

[19]  Liqing Zhang,et al.  XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Lianwen Jin,et al.  LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding , 2022, ACL.

[21]  Ives Macêdo,et al.  SeeTek: Very Large-Scale Open-set Logo Recognition with Text-Aware Metric Learning , 2022, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[22]  Srikar Appalaraju,et al.  LaTr: Layout-Aware Transformer for Scene-Text VQA , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dongyoon Han,et al.  OCR-Free Document Understanding Transformer , 2021, ECCV.

[24]  Li Dong,et al.  Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Furu Wei,et al.  MarkupLM: Pre-training of Text and Markup Language for Visually Rich Document Understanding , 2021, ACL.

[26]  Jean Oh,et al.  Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[27]  Errui Ding,et al.  StrucTexT: Structured Text Understanding with Multi-Modal Transformers , 2021, ACM Multimedia.

[28]  Bhargava Urala Kota,et al.  DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Hongfu Liu,et al.  SelfDoc: Self-Supervised Document Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[31]  Tomasz Dwojak,et al.  Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[32]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[33]  Cha Zhang,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[34]  Jiebo Luo,et al.  TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[36]  C. V. Jawahar,et al.  DocVQA: A Dataset for VQA on Document Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Seunghyun Park,et al.  Spatial Dependency Parsing for Semi-Structured Document Information Extraction , 2020, FINDINGS.

[38]  M. Turski,et al.  DUE: End-to-End Document Understanding Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[39]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Wei Han,et al.  Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering , 2020, COLING.

[41]  Yusheng Xie,et al.  Towards Good Practices in Self-supervised Representation Learning , 2020, ArXiv.

[42]  A. Schwing,et al.  Spatially Aware Multimodal Transformers for TextVQA , 2020, ECCV.

[43]  R. Manmatha,et al.  SCATTER: Selective Context Attentional Scene Text Recognizer , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[45]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[46]  Trevor Darrell,et al.  Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[48]  Wenhu Chen,et al.  TabFact: A Large-scale Dataset for Table-based Fact Verification , 2019, ICLR.

[49]  BROS: A PRE-TRAINED LANGUAGE MODEL , 2020 .

[50]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[51]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[52]  Seunghyun Park,et al.  CORD: A Consolidated Receipt Dataset for Post-OCR Parsing , 2019 .

[53]  Shashank Shekhar,et al.  OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[54]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[55]  Ernest Valveny,et al.  ICDAR 2019 Competition on Scene Text Visual Question Answering , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[56]  Ernest Valveny,et al.  Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[58]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[61]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[64]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).