Table Understanding in Structured Documents

Table detection and extraction has been studied in the context of documents like reports, where tables are clearly outlined and stand out from the document structure visually. We study this topic in a rather more challenging domain of layout-heavy business documents, particularly invoices. Invoices present the novel challenges of tables being often without outlines - either in the form of borders or surrounding text flow - with ragged columns and widely varying data content. We will also show, that we can extract specific information from structurally different tables or table-like structures with one model. We present a comprehensive representation of a page using graph over word boxes, positional embeddings, trainable textual features and rephrase the table detection as a text box labeling problem. We will work on our newly presented dataset of pro forma invoices, invoices and debit note documents using this representation and propose multiple baselines to solve this labeling problem. We then propose a novel neural network model that achieves strong, practical results on the presented dataset and analyze the model performance and effects of graph convolutions and self-attention in detail.

[1]  Goran Nenadic,et al.  Table mining and data curation from biomedical literature , 2014 .

[2]  Christopher Andreas Clark,et al.  Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers , 2015, AAAI Workshop: Scholarly Big Data.

[3]  Maria Gabrani,et al.  Interpreting Data from Scanned Tables , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[4]  Concetto Spampinato,et al.  A Saliency-based Convolutional Neural Network for Table and Chart Detection in Digitized Documents , 2018, ICIAP.

[5]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[6]  Nikola Milosevic,et al.  A multi-layered approach to information extraction from tables in biomedical documents , 2018 .

[7]  Yeye He,et al.  TEGRA: Table Extraction by Global Record Alignment , 2015, SIGMOD Conference.

[8]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[9]  Yiming Yang,et al.  Learning Table Extraction from Examples , 2004, COLING.

[10]  Mathias Niepert,et al.  Learning Convolutional Neural Networks for Graphs , 2016, ICML.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[13]  Luís Torgo,et al.  Automatic Selection of Table Areas in Documents for Information Extraction , 2003, EPIA.

[14]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[15]  David Doermann,et al.  Handbook of Document Image Processing and Recognition , 2014, Springer London.

[16]  Yolande Belaïd,et al.  Case-Based Reasoning for Invoice Analysis and Recognition , 2007, ICCBR.

[17]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[19]  Ying Liu,et al.  TableSeer: Automatic Table Extraction, Search, and Understanding. , 2009 .

[20]  Roshan G. Ragel,et al.  Locating tables in scanned documents for reconstructing and republishing , 2014, 7th International Conference on Information and Automation for Sustainability.

[21]  Vincent Poulain D'Andecy,et al.  Field Extraction by Hybrid Incremental and A-Priori Structural Templates , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[22]  Aurélie Lemaitre,et al.  Recognition of Tables and Forms , 2014, Handbook of Document Image Processing and Recognition.

[23]  Muhammad Imran Malik,et al.  Table Detection Using Deep Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[24]  Ioannis Pratikakis,et al.  Automatic Table Detection in Document Images , 2005, ICAPR.

[25]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[26]  Michalis Vazirgiannis,et al.  Graph Classification with 2D Convolutional Neural Networks , 2017, ICANN.

[27]  Miao Fan,et al.  Detecting Table Region in PDF Documents Using Distant Supervision , 2015 .

[28]  Amit Kumar Das,et al.  A Very Efficient Table Detection System from Document Images , 2004, ICVGIP.

[29]  Hwee Tou Ng,et al.  Learning to Recognize Tables in Free Text , 1999, ACL.

[30]  Liusheng Huang,et al.  More than Word Frequencies: Authorship Attribution via Natural Frequency Zoned Word Distribution Analysis , 2012, ArXiv.

[31]  Anand Gupta,et al.  Table Detection and Metadata Extraction in Document Images , 2019 .