DocILE 2023 Teaser: Document Information Localization and Extraction

The lack of data for information extraction (IE) from semi-structured business documents is a real problem for the IE community. Publications relying on large-scale datasets use only proprietary, unpublished data due to the sensitive nature of such documents. Publicly available datasets are mostly small and domain-specific. The absence of a large-scale public dataset or benchmark hinders the reproducibility and cross-evaluation of published methods. The DocILE 2023 competition, hosted as a lab at the CLEF 2023 conference and as an ICDAR 2023 competition, will run the first major benchmark for the tasks of Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) from business documents. With thousands of annotated real documents from open sources, a hundred thousand of generated synthetic documents, and nearly a million unlabeled documents, the DocILE lab comes with the largest publicly available dataset for KILE and LIR. We are looking forward to contributions from the Computer Vision, Natural Language Processing, Information Retrieval, and other communities. The data, baselines, code and up-to-date information about the lab and competition are available at https://docile.rossum.ai/.

[1]  Stepán Simsa,et al.  Business Document Information Extraction: Towards Practical Benchmarks , 2022, CLEF.

[2]  P. Staar,et al.  TableFormer: Table Structure Understanding with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Constantin Spille,et al.  Key Information Extraction From Documents: Evaluation And Generator , 2021, DeepOntoNLP/X-SENTIMENT@ESWC.

[4]  Qiang Huo,et al.  ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents , 2021, ICDAR.

[5]  P. Biecek,et al.  Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts , 2021, ICDAR.

[6]  Furu Wei,et al.  LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding , 2021, ArXiv.

[7]  C. V. Jawahar,et al.  DocVQA: A Dataset for VQA on Document Images , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8]  Sandeep Tata,et al.  Representation Learning for Information Extraction from Form-like Documents , 2020, ACL.

[9]  Jeonghun Baek,et al.  CLEval: Character-Level Evaluation for Text Detection and Recognition Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Lucian Popa,et al.  Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[11]  Lukasz Garncarek,et al.  LAMBERT: Layout-Aware Language Modeling for Information Extraction , 2020, ICDAR.

[12]  Pranjal Dhakal,et al.  One-Shot Template Matching for Automatic Document Data Capture , 2019, 2019 Artificial Intelligence for Transforming Business and Society (AITB).

[13]  Timo I. Denk,et al.  BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding , 2019, ArXiv.

[14]  Lovekesh Vig,et al.  One-shot Information Extraction from Document Images using Neuro-Deductive Program Synthesis , 2019, NeSy@IJCAI.

[15]  Jean-Philippe Thiran,et al.  FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[16]  Xiaohui Zhao,et al.  CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor , 2019, ArXiv.

[17]  Martin Holecek,et al.  Table Understanding in Structured Documents , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[18]  Ole Winther,et al.  Attend, Copy, Parse End-to-end Information Extraction from Documents , 2018, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[19]  Steffen Bickel,et al.  Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[20]  Ole Winther,et al.  CloudScan - A Configuration-Free Invoice Analysis System Using Recurrent Neural Networks , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[21]  A. Vedaldi,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Konstantinos G. Derpanis,et al.  Evaluation of deep convolutional nets for document image classification and retrieval , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[24]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[25]  Alexander Schill,et al.  Intellix -- End-User Trained Information Extraction for Document Archiving , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[26]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[27]  Megan Sorenson,et al.  Library , 1958 .

[28]  Paul Drews,et al.  Information Extraction from Invoices: A Graph Neural Network Approach for Datasets with High Layout Variety , 2021, Wirtschaftsinformatik.

[29]  M. Turski,et al.  DUE: End-to-End Document Understanding Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[30]  Mickaël Coustaty,et al.  Information Extraction from Invoices , 2021, ICDAR.

[31]  Andrew Chisholm,et al.  Extracting structured data from invoices , 2018, ALTA.

[32]  Yaqi Zhang,et al.  Unstructured Document Recognition on Business Invoice CS 229 : Machine Learning , 2016 .