Towards End-to-End Transformation of Arbitrary Tables from Untagged Portable Documents (PDF) to Linked Data

The paper is devoted to the problem of an end-to-end table transformation from untagged portable documents (PDF) to linked data. It covers the issues of the table extraction from documents, the reconstruction of logical table structure, the conceptualization of their natural-language content, and the linking of extracted data with external vocabularies. We consider some perspective approaches for the deeplearning-based table detection, heuristic-based table structure recognition, rule-based table analysis, and knowledge-based table interpretation. They can be used as a basis to develop a consistent solution for this problem. Our application experience confirms that such solutions are demanded for populating databases and generating ontologies with tabular data being extracted from weakly and semi-structured documents.

[1]  Wolfgang Lehner,et al.  Table Identification and Reconstruction in Spreadsheets , 2017, CAiSE.

[2]  Daniel Kifer,et al.  Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[3]  Haixun Wang,et al.  Understanding Tables on the Web , 2012, ER.

[4]  Zhi Tang,et al.  ICDAR2017 Competition on Page Object Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[5]  Zhi Tang,et al.  A Table Detection Method for PDF Documents Based on Convolutional Neural Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[6]  Viacheslav Paramonov,et al.  TabbyPDF: Web-Based System for PDF Table Extraction , 2018, ICIST.

[7]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[8]  David W. Embley,et al.  Towards Ontology Generation from Tables , 2005, World Wide Web.

[9]  Vasilis Efthymiou,et al.  Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings , 2017, SEMWEB.

[10]  Timothy W. Finin,et al.  Using Linked Data to Interpret Tables , 2010, COLD.

[11]  Trilce Estrada,et al.  TAO: System for Table Detection and Extraction from PDF Documents , 2016, FLAIRS.

[12]  Timothy W. Finin,et al.  A Domain Independent Framework for Extracting Linked Semantic Data from Tables , 2012, SeCO Book.

[13]  Sören Auer,et al.  Identifying Web Tables: Supporting a Neglected Type of Content on the Web , 2015, KESW.

[14]  Hye-Young Paik,et al.  TEXUS: A unified framework for extracting and understanding tables in PDF documents , 2019, Inf. Process. Manag..

[15]  Jian Li,et al.  Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases , 2013, Proc. VLDB Endow..

[16]  Saman Arif,et al.  Table Detection in Document Images using Foreground and Background Features , 2018, 2018 Digital Image Computing: Techniques and Applications (DICTA).

[17]  Felienne Hermans,et al.  Semi-automatic Extraction of Cross-Table Data from a Set of Spreadsheets , 2017, IS-EUD.

[18]  Cui Tao,et al.  Automating the extraction of data from HTML tables with unknown structure , 2005, Data Knowl. Eng..

[19]  Katrin Braunschweig Recovering the Semantics of Tabular Web Data , 2015 .

[20]  Kenji Kita,et al.  Table Topic Models for Hidden Unit Estimation , 2016, AIRS.

[21]  Concetto Spampinato,et al.  A Saliency-based Convolutional Neural Network for Table and Chart Detection in Digitized Documents , 2018, ICIAP.

[22]  H. V. Jagadish,et al.  Foofah: Transforming Data By Example , 2017, SIGMOD Conference.

[23]  Hye-Young Paik,et al.  TEXUS: Table Extraction System for PDF Documents , 2018, ADC.

[24]  Andreas Dengel,et al.  Table Recognition in Heterogeneous Documents Using Machine Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[25]  Dominique Ritze,et al.  Matching Web Tables To DBpedia - A Feature Utility Study , 2017, EDBT.

[26]  Ziqi Zhang,et al.  Towards Efficient and Effective Semantic Table Interpretation , 2014, SEMWEB.

[27]  Dominique Ritze,et al.  Matching HTML Tables to DBpedia , 2015, WIMS.

[28]  Zhe Chen,et al.  Information Extraction on Para-Relational Data , 2016 .

[29]  Andrey Mikhailov,et al.  Software Development for Rule-Based Spreadsheet Data Extraction and Transformation , 2019, 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[30]  Franz Wotawa,et al.  On the Refinement of Spreadsheet Smells by means of Structure Information , 2019, J. Syst. Softw..

[31]  Muhammad Imran Malik,et al.  Table Detection Using Deep Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[32]  Alexey O. Shigarov,et al.  Table understanding using a rule engine , 2015, Expert Syst. Appl..

[33]  Gaurav Harit,et al.  Table Extraction from Document Images using Fixed Point Model , 2014, ICVGIP.

[34]  Alexey O. Shigarov,et al.  Rule-Based Table Analysis and Interpretation , 2015, ICIST.

[35]  Miao Fan,et al.  Table Region Detection on Large-scale PDF Files without Labeled Data , 2015, ArXiv.

[36]  Hajo Rijgersberg,et al.  Combining information on structure and content to automatically annotate natural science spreadsheets , 2017, Int. J. Hum. Comput. Stud..

[37]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[38]  Doug Downey,et al.  TabEL: Entity Linking in Web Tables , 2015, SEMWEB.

[39]  Shuo Yang,et al.  Semantic interoperability with heterogeneous information systems on the internet through automatic tabular document exchange , 2017, Inf. Syst..

[40]  Jácome Cunha,et al.  Model inference for spreadsheets , 2014, Automated Software Engineering.

[41]  Guilin Qi,et al.  Entity Linking in Web Tables with Multiple Linked Knowledge Bases , 2016, JIST.

[42]  Alessandra Mileo,et al.  Using linked data to mine RDF from wikipedia's tables , 2014, WSDM.

[43]  Hajo Rijgersberg,et al.  Converting and Annotating Quantitative Data Tables , 2010, SEMWEB.

[44]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[45]  In Seop Na,et al.  Table Detection from Document Image using Vertical Arrangement of Text Blocks , 2015 .

[46]  Sumit Gulwani,et al.  FlashRelate: extracting relational data from semi-structured spreadsheets using examples , 2015, PLDI.

[47]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[48]  David W. Embley,et al.  Converting heterogeneous statistical tables on the web to searchable databases , 2016, International Journal on Document Analysis and Recognition (IJDAR).

[49]  Jácome Cunha,et al.  Embedding, Evolution, and Validation of Model-Driven Spreadsheets , 2015, IEEE Transactions on Software Engineering.

[50]  Zhe Chen,et al.  Spreadsheet Property Detection With Rule-assisted Active Learning , 2017, CIKM.

[51]  Dongmei Zhang,et al.  Expandable Group Identification in Spreadsheets , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[52]  Alexey O. Shigarov,et al.  Rule-based spreadsheet data transformation from arbitrary to relational tables , 2017, Inf. Syst..

[53]  Arie van Deursen,et al.  Detecting and refactoring code smells in spreadsheet formulas , 2013, Empirical Software Engineering.

[54]  Shuo Yang,et al.  Semantic Interoperability for Electronic Business through a Novel Cross-Context Semantic Document Exchange Approach , 2018, DocEng.

[55]  Thomas Kieninger,et al.  An open approach towards the benchmarking of table structure recognition systems , 2010, DAS '10.

[56]  Ioana Manolescu,et al.  Extracting linked data from statistic spreadsheets , 2017, SBD@SIGMOD.

[57]  Sumit Gulwani,et al.  Transforming spreadsheet data types using examples , 2016, POPL.

[58]  Wei Shen,et al.  LIEGE:: link entities in web lists with knowledge base , 2012, KDD.

[59]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[60]  Ziqi Zhang,et al.  Effective and efficient Semantic Table Interpretation using TableMiner+ , 2017, Semantic Web.

[61]  Emery D. Berger,et al.  ExceLint: automatically finding spreadsheet formula errors , 2018, Proc. ACM Program. Lang..

[62]  Alexey O. Shigarov,et al.  Configurable Table Structure Recognition in Untagged PDF documents , 2016, DocEng.

[63]  Axel-Cyrille Ngonga Ngomo,et al.  TAIPAN: Automatic Property Mapping for Tabular Data , 2016, EKAW.

[64]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Porfirio Tramontana,et al.  A Reverse Engineering Process for Inferring Data Models from Spreadsheet-based Information Systems: An Automotive Industrial Experience , 2014, DATA.

[66]  Alexey O. Shigarov,et al.  TabbyXL: Software platform for rule-based spreadsheet data extraction and transformation , 2019, SoftwareX.

[67]  Christopher Ré,et al.  Understanding Tables in Context Using Standard NLP Toolkits , 2013, ACL.

[68]  Andreas Dengel,et al.  DeCNT: Deep Deformable CNN for Table Detection , 2018, IEEE Access.

[69]  Maria Teresa Pazienza,et al.  Sheet2RDF: a Flexible and Dynamic Spreadsheet Import&Lifting Framework for RDF , 2015, IEA/AIE.

[70]  Moonis Ali,et al.  Proceedings of the 19th international conference on Advances in Applied Artificial Intelligence: industrial, Engineering and Other Applications of Applied Intelligent Systems , 2006 .

[71]  Viacheslav V. Paramonov,et al.  Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets , 2016, ICIST.

[72]  Chang Xu,et al.  CACheck: Detecting and Repairing Cell Arrays in Spreadsheets , 2017, IEEE Transactions on Software Engineering.