Table extraction, analysis, and interpretation: the current state of the TabbyDOC project

The freely available tabular data represented in various digital formats, such as print-oriented documents, spreadsheets, and web pages, are a valuable source to populate knowledge graphs. However, difficulties that inevitably arise with the extraction and integration of the tabular data often hinder their intensive use in practice. TabbyDOC project aims at elaborating a theoretical basis and developing open software for data extraction from arbitrary tables. Previously, it was devoted to the following issues: (i) table extraction tables from print-oriented documents, (ii) data transformation from spreadsheet tables to relational and linked data. This paper summarizes the project’s results that are intended for the following tasks: (i) automation of fine-tuning artificial neural networks for table detection in document images, (ii) a synthesis of programs for spreadsheet data transformation driven by user-defined rules of table analysis and interpretation, and (iii) generating RDF-triples from entities extracted from relational tables.

[1]  Goran Nenadic,et al.  A framework for information extraction from tables in biomedical literature , 2019, International Journal on Document Analysis and Recognition (IJDAR).

[2]  Viacheslav V. Paramonov,et al.  Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets , 2016, ICIST.

[3]  Andrey Mikhailov,et al.  Software Development for Rule-Based Spreadsheet Data Extraction and Transformation , 2019, 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[4]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  P. Alam ‘W’ , 2021, Composites Engineering.

[6]  Alexey O. Shigarov,et al.  Rule-based spreadsheet data transformation from arbitrary to relational tables , 2017, Inf. Syst..

[7]  Nikita O. Dorodnykh,et al.  Towards a universal approach for semantic interpretation of spreadsheets data , 2020, IDEAS.

[8]  Kara H. Woo,et al.  Data Organization in Spreadsheets , 2018 .

[9]  Xindong Wu,et al.  Object Detection With Deep Learning: A Review , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Viacheslav Paramonov,et al.  TabbyXL: Rule-Based Spreadsheet Data Extraction and Transformation , 2019, ICIST.

[11]  Alexey O. Shigarov,et al.  Rule-Based Table Analysis and Interpretation , 2015, ICIST.

[12]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[13]  Andrey Mikhailov,et al.  On automated workflow for fine-tuning deepneural network models for table detection in document images , 2020, 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO).

[14]  A. Shigarov,et al.  On Graph-Based Verification for PDF Table Detection , 2020, 2020 Ivannikov Ispras Open Conference (ISPRAS).

[15]  Nikita O. Dorodnykh,et al.  TabbyLD: A Tool for Semantic Interpretation of Spreadsheets Data , 2020, MDIS.

[16]  Viacheslav Paramonov,et al.  TabbyPDF: Web-Based System for PDF Table Extraction , 2018, ICIST.

[17]  Aleksandr Yu. Yurin,et al.  Towards Ontology Engineering Based on Transformation of Conceptual Models and Spreadsheet Data: A Case Study , 2019, Intelligent Systems Applications in Software Engineering.

[18]  Rafael Corchuelo,et al.  TOMATE: A heuristic-based approach to extract data from HTML tables , 2021, Inf. Sci..

[19]  Nikita O. Dorodnykh,et al.  Conceptual Model Engineering for Industrial Safety Inspection Based on Spreadsheet Data Analysis , 2020 .

[20]  Thomas Kieninger,et al.  The T-Recs Table Recognition and Analysis System , 1998, Document Analysis Systems.

[21]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[22]  Alexey O. Shigarov,et al.  Configurable Table Structure Recognition in Untagged PDF documents , 2016, DocEng.

[23]  Zhi Tang,et al.  ICDAR2017 Competition on Page Object Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[24]  Gebräuchliche Fertigarzneimittel,et al.  V , 1893, Therapielexikon Neurologie.

[25]  Viacheslav Paramonov,et al.  Table Header Correction Algorithm Based on Heuristics for Improving Spreadsheet Data Extraction , 2020, ICIST.

[26]  Steffen Staab,et al.  Knowledge graphs , 2021, Commun. ACM.

[27]  Krisztian Balog,et al.  Web Table Extraction, Retrieval, and Augmentation: A Survey , 2020, ACM Trans. Intell. Syst. Technol..

[28]  Nikita O. Dorodnykh,et al.  Personal knowledge base designer: Software for expert systems prototyping , 2020, SoftwareX.

[29]  Ivan Lopez-Arevalo,et al.  Information extraction meets the Semantic Web: A survey , 2020, Semantic Web.

[30]  Alexey O. Shigarov,et al.  TabbyXL: Software platform for rule-based spreadsheet data extraction and transformation , 2019, SoftwareX.

[31]  Wolfgang Lehner,et al.  From Web Tables to Concepts: A Semantic Normalization Approach , 2015, ER.

[32]  Giorgio Orsi,et al.  A methodology for evaluating algorithms for table understanding in PDF documents , 2012, DocEng '12.

[33]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[34]  Nikita O. Dorodnykh,et al.  Experimental Evaluation of a Spreadsheets Transformation in the Context of Domain Model Engineering* , 2020, 2020 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT).

[35]  Viacheslav Paramonov,et al.  Heuristic Algorithm for Recovering a Physical Structure of Spreadsheet Header , 2019, ISAT.

[36]  Yu Fang,et al.  ICDAR 2019 Competition on Table Detection and Recognition (cTDaR) , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[37]  Thomas G Kieninger,et al.  Table structure recognition based on robust block segmentation , 1998, Electronic Imaging.