VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach

We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3D matrix used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image datasets, showing higher performances compared to the recent state-of-the-art methods.

[1]  Christian Reisswig,et al.  BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding , 2019, ArXiv.

[2]  Marco Wiering,et al.  Deep Neural Networks with Intersection over Union Loss for Binary Image Segmentation , 2019, ICPRAM.

[3]  Hiroyuki Shindo,et al.  Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia , 2020, EMNLP.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[7]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[8]  Zhi Tang,et al.  A Table Detection Method for PDF Documents Based on Convolutional Neural Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[9]  Alicia Fornés,et al.  Table Detection in Invoice Documents by Graph Neural Networks , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Nicolas Audebert,et al.  Multimodal deep networks for text and image-based document classification , 2019, PKDD/ECML Workshops.

[12]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[13]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[14]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[15]  Steffen Bickel,et al.  Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[16]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2020, KDD.