DiT: Self-supervised Pre-training for Document Image Transformer

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 - 92.69), document layout analysis (91.0 - 94.9), table detection (94.23 - 96.55) and text detection for OCR (93.07 - 94.29). The code and pre-trained models are publicly available at https://aka.ms/msdit.

[1]  Furu Wei,et al.  LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , 2022, ACM Multimedia.

[2]  Ross B. Girshick,et al.  Benchmarking Detection Transfer Learning with Vision Transformers , 2021, ArXiv.

[3]  Furu Wei,et al.  Document AI: Benchmarks, Models and Applications , 2021, ArXiv.

[4]  Tao Kong,et al.  iBOT: Image BERT Pre-Training with Online Tokenizer , 2021, ArXiv.

[5]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Bhargava Urala Kota,et al.  DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Matthijs Douze,et al.  XCiT: Cross-Covariance Image Transformers , 2021, NeurIPS.

[8]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[9]  Hongfu Liu,et al.  SelfDoc: Self-Supervised Document Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Fei Huang,et al.  StructuralLM: Structural Pre-training for Form Understanding , 2021, ACL.

[11]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Furu Wei,et al.  LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding , 2021, ArXiv.

[13]  Michael Bendersky,et al.  LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding , 2021, ArXiv.

[14]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[16]  Tomasz Dwojak,et al.  Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[17]  Cha Zhang,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[18]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[19]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[20]  Shashank Mujumdar,et al.  Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning , 2020, ArXiv.

[21]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[22]  Furu Wei,et al.  DocBank: A Benchmark Dataset for Document Layout Analysis , 2020, COLING.

[23]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[24]  Lukasz Garncarek,et al.  LAMBERT: Layout-Aware Language Modeling for Information Extraction , 2020, ICDAR.

[25]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[26]  Antonio Jimeno-Yepes,et al.  Image-based table recognition: data, model, and evaluation , 2019, ECCV.

[27]  Kai Chen,et al.  Real-time Scene Text Detection with Differentiable Binarization , 2019, AAAI.

[28]  Yu Fang,et al.  ICDAR 2019 Competition on Table Detection and Recognition (cTDaR) , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[29]  Antonio Jimeno-Yepes,et al.  PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[30]  Arnab Nandi,et al.  Deterministic Routing between Layout Abstractions for Multi-Scale Classification of Visually Rich Documents , 2019, IJCAI.

[31]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[32]  Zhoujun Li,et al.  TableBank: Table Benchmark for Image-based Table Detection and Recognition , 2019, LREC.

[33]  Ujjwal Bhattacharya,et al.  Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[34]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Marcus Liwicki,et al.  Cutting the Error by Half: Investigation of Very Deep CNN and Advanced Training Strategies for Document Image Classification , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[37]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[38]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[40]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[43]  Furu Wei,et al.  XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding , 2022, FINDINGS.

[44]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  BROS: A PRE-TRAINED LANGUAGE MODEL , 2020 .

[46]  Thomas Keller,et al.  Towards a multi-modal IT , 2018 .