Form-NLU: Dataset for the Form Natural Language Understanding

Compared to general document analysis tasks, form document structure understanding and retrieval are challenging. Form documents are typically made by two types of authors; A form designer, who develops the form structure and keys, and a form user, who fills out form values based on the provided keys. Hence, the form values may not be aligned with the form designer's intention (structure and keys) if a form user gets confused. In this paper, we introduce Form-NLU, the first novel dataset for form structure understanding and its key and value information extraction, interpreting the form designer's intent and the alignment of user-written value on it. It consists of 857 form images, 6k form keys and values, and 4k table keys and values. Our dataset also includes three form types: digital, printed, and handwritten, which cover diverse form appearances and layouts. We propose a robust positional and logical relation-based form key-value information extraction framework. Using this dataset, Form-NLU, we first examine strong object detection models for the form layout understanding, then evaluate the key information extraction task on the dataset, providing fine-grained results for different types of forms and keys. Furthermore, we examine it with the off-the-shelf pdf layout extraction tool and prove its feasibility in real-world cases.

[1]  Soyeon Caren Han,et al.  Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis , 2022, COLING.

[2]  Soyeon Caren Han,et al.  Understanding Attention for Vision-and-Language Tasks , 2022, COLING.

[3]  Soyeon Caren Han,et al.  V-Doc : Visual questions answers with Documents , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Furu Wei,et al.  LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking , 2022, ACM Multimedia.

[5]  Liqing Zhang,et al.  XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Zuyi Bao,et al.  Entity Relation Extraction as Dependency Parsing in Visually Rich Documents , 2021, EMNLP.

[7]  Stefan Feuerriegel,et al.  DocParser: Hierarchical Document Structure Parsing from Renderings , 2021, AAAI.

[8]  P. Biecek,et al.  Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts , 2021, ICDAR.

[9]  Furu Wei,et al.  LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding , 2021, ArXiv.

[10]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[11]  Kyosuke Nishida,et al.  VisualMRC: Machine Reading Comprehension on Document Images , 2021, AAAI.

[12]  Jiaxin Zhang,et al.  Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution , 2021, AAAI.

[13]  Cha Zhang,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[14]  C. V. Jawahar,et al.  DocVQA: A Dataset for VQA on Document Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[15]  Furu Wei,et al.  DocBank: A Benchmark Dataset for Document Layout Analysis , 2020, COLING.

[16]  Lucian Popa,et al.  Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[18]  Antonio Jimeno-Yepes,et al.  Image-based table recognition: data, model, and evaluation , 2019, ECCV.

[19]  Trevor Darrell,et al.  Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Seunghyun Park,et al.  CORD: A Consolidated Receipt Dataset for Post-OCR Parsing , 2019 .

[21]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[22]  Antonio Jimeno-Yepes,et al.  PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[23]  Heyan Huang,et al.  Complicated Table Structure Recognition , 2019, ArXiv.

[24]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[25]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[28]  W. Ng,et al.  The Informativeness of Substantial Shareholder Trading in the Lead Up to a Takeover Bid , 2016 .

[29]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[30]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  W. A. Martin,et al.  Parsing , 1980, ACL.

[32]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[33]  Furu Wei,et al.  XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding , 2022, FINDINGS.