DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents

Information Extraction from visually rich documents is a challenging task that has gained a lot of attention in recent years due to its importance in several document-control based applications and its widespread commercial value. The majority of the research work conducted on this topic to date follow a two-step pipeline. First, they read the text using an off-the-shelf Optical Character Recognition (OCR) engine, then, they extract the fields of interest from the obtained text. The main drawback of these approaches is their dependence on an external OCR system, which can negatively impact both performance and computational speed. Recent OCR-free methods were proposed to address the previous issues. Inspired by their promising results, we propose in this paper an OCR-free end-to-end information extraction model named DocParser. It differs from prior end-to-end approaches by its ability to better extract discriminative character features. DocParser achieves state-of-the-art results on various datasets, while still being faster than previous works.

[1]  Shiliang Pu,et al.  TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents , 2022, ArXiv.

[2]  Yu-Gang Jiang,et al.  SVTR: Scene Text Recognition with a Single Visual Model , 2022, IJCAI.

[3]  Furu Wei,et al.  LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , 2022, ACM Multimedia.

[4]  Vlad I. Morariu,et al.  End-to-end Document Recognition and Understanding with Dessurt , 2022, ECCV Workshops.

[5]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Dongyoon Han,et al.  OCR-Free Document Understanding Transformer , 2021, ECCV.

[7]  Dianhai Yu,et al.  PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System , 2021, ArXiv.

[8]  Sungrae Park,et al.  BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents , 2021, AAAI.

[9]  Shachar Klaiman,et al.  DocReader: Bounding-Box Free Training of a Document Information Extraction Model , 2021, ICDAR.

[10]  Zhanghui Kuang,et al.  Spatial Dual-Modality Graph Reasoning for Key Information Extraction , 2021, ArXiv.

[11]  Tomasz Dwojak,et al.  Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer , 2021, ICDAR.

[12]  Jiaxin Zhang,et al.  Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution , 2021, AAAI.

[13]  Cha Zhang,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[14]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[15]  Aymen Shabou,et al.  VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach , 2020, ICDAR Workshops.

[16]  Qiong Zhang,et al.  Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models , 2020, SIGIR.

[17]  Lukasz Garncarek,et al.  LAMBERT: Layout-Aware Language Modeling for Information Extraction , 2020, ICDAR.

[18]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[19]  Kai Chen,et al.  Real-time Scene Text Detection with Differentiable Binarization , 2019, AAAI.

[20]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[21]  Timo I. Denk,et al.  BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding , 2019, ArXiv.

[22]  Xiameng Qin,et al.  EATEN: Entity-Aware Attention for Single Shot Visual Text Extraction , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[23]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24]  Dongyoon Han,et al.  Character Region Awareness for Text Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Seong Joon Oh,et al.  What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Xiaohui Zhao,et al.  CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor , 2019, ArXiv.

[27]  Xiaojing Liu,et al.  Graph Convolution for Multimodal Information Extraction from Visually Rich Documents , 2019, NAACL.

[28]  Steffen Bickel,et al.  Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[29]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[30]  Ole Winther,et al.  CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[31]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[34]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[37]  Nachum Dershowitz,et al.  OCR Error Correction Using Character Correction and Feature-Based Word Classification , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[38]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Vincent Poulain D'Andecy,et al.  Field Extraction from Administrative Documents by Incremental Structural Templates , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[41]  Eric Medvet,et al.  A probabilistic approach to printed document understanding , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[42]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[43]  Francesca Cesarini,et al.  Analysis and understanding of multi-class invoices , 2003, Document Analysis and Recognition.

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.