Table extraction and understanding for scientific and enterprise applications

Valuable high-precision data are often published in the form of tables in both scientific and business documents. While humans can easily identify, interpret and contextualize tables, developing general-purpose automated techniques for extraction of information from tables is difficult due to the wide variety of table formats employed across corpora. To extract useful data from tables, data cells must be correctly extracted and linked to all relevant headers, units of measure and in-text references. Table extraction involves identifying the border and cell structure for each document table, while table understanding provides context by linking cells with semantic information inside and outside the table, such as row and column headers, footnotes, titles, and references in surrounding text. The objective of this tutorial is to provide a detailed synopsis of existing approaches for table extraction and understanding, highlight open research problems, and provide an overview of potential applications.

[1]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[2]  Yalin Wang,et al.  Table structure understanding and its performance evaluation , 2004, Pattern Recognit..

[3]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[5]  Wolfgang Lehner,et al.  Building the Dresden Web Table Corpus: A Classification Approach , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[6]  Maneesh Agrawala,et al.  Facilitating Document Reading by Linking Text and Tables , 2018, UIST.

[7]  Zhe Chen,et al.  Integrating spreadsheet data via accurate and low-effort extraction , 2014, KDD.

[8]  Massimo Ruffolo,et al.  PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[9]  Cong Yu,et al.  Generating Titles for Web Tables , 2018, WWW.

[10]  Hao Ma,et al.  Table Cell Search for Question Answering , 2016, WWW.

[11]  Muhammad Imran Malik,et al.  Table Detection Using Deep Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[12]  Alexandre V. Evfimievski,et al.  A Rectangle Mining Method for Understanding the Semantics of Financial Tables , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[13]  Katsuhiko Itonori,et al.  Table structure recognition based on textblock arrangement and ruled line position , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[14]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[15]  Michael Gillmann,et al.  Table Content Understanding in SmartFIX , 2011, 2011 International Conference on Document Analysis and Recognition.

[16]  Faisal Shafait,et al.  Table detection in heterogeneous documents , 2010, DAS '10.

[17]  Zhi Tang,et al.  Table Header Detection and Classification , 2012, AAAI.

[18]  Krisztian Balog,et al.  Web Table Extraction, Retrieval and Augmentation , 2019, SIGIR.

[19]  Rajasekar Krishnamurthy,et al.  Creation and Interaction with Large-scale Domain-Specific Knowledge Bases , 2017, Proc. VLDB Endow..

[20]  Kugatsu Sadamitsu,et al.  Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture , 2017, AAAI.

[21]  Dominique Ritze,et al.  A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[22]  Doug Downey,et al.  Methods for exploring and mining tables on Wikipedia , 2013, IDEA@KDD.

[23]  Christopher Ré,et al.  Fonduer: Knowledge Base Construction from Richly Formatted Data , 2017, SIGMOD Conference.

[24]  Charles Jochim,et al.  Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction , 2019, ACL.

[25]  Matthias Frey,et al.  Efficient Table Annotation for Digital Articles , 2015, D Lib Mag..

[26]  Daisy Zhe Wang,et al.  Ten Years of WebTables , 2018, Proc. VLDB Endow..

[27]  Faisal Shafait,et al.  Rethinking Table Parsing using Graph Neural Networks , 2019, ArXiv.

[28]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[29]  Krisztian Balog,et al.  Web Table Extraction, Retrieval, and Augmentation: A Survey , 2020, ACM Trans. Intell. Syst. Technol..

[30]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[31]  Hye-Young Paik,et al.  TEXUS: A unified framework for extracting and understanding tables in PDF documents , 2019, Inf. Process. Manag..

[32]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[33]  Yu Fang,et al.  ICDAR 2019 Competition on Table Detection and Recognition (cTDaR) , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[34]  Andreas Dengel,et al.  DeCNT: Deep Deformable CNN for Table Detection , 2018, IEEE Access.

[35]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[36]  Giorgio Orsi,et al.  A methodology for evaluating algorithms for table understanding in PDF documents , 2012, DocEng '12.

[37]  Shashank Mujumdar,et al.  Simultaneous Optimisation of Image Quality Improvement and Text Content Extraction from Scanned Documents , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[38]  George Nagy,et al.  Segmenting Tables via Indexing of Value Cells by Table Headers , 2013, 2013 12th International Conference on Document Analysis and Recognition.