Tab2Know: Building a Knowledge Base from Tables in Scientific Papers

Tables in scientific papers contain a wealth of valuable knowledge for the scientific enterprise. To help the many of us who frequently consult this type of knowledge, we present Tab2Know, a new end-to-end system to build a Knowledge Base (KB) from tables in scientific papers. Tab2Know addresses the challenge of automatically interpreting the tables in papers and of disambiguating the entities that they contain. To solve these problems, we propose a pipeline that employs both statistical-based classifiers and logic-based reasoning. First, our pipeline applies weakly supervised classifiers to recognize the type of tables and columns, with the help of a data labeling system and an ontology specifically designed for our purpose. Then, logic-based reasoning is used to link equivalent entities (via sameAs links) in different tables. An empirical evaluation of our approach using a corpus of papers in the Computer Science domain has returned satisfactory performance. This suggests that ours is a promising step to create a large-scale KB of scientific knowledge.

[1]  Dominique Ritze,et al.  Matching HTML Tables to DBpedia , 2015, WIMS.

[2]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[3]  Paolo Papotti,et al.  Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks , 2020, SIGMOD Conference.

[4]  Frederic Sala,et al.  Training Complex Models with Multi-Task Weak Supervision , 2018, AAAI.

[5]  Vasilis Efthymiou,et al.  Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings , 2017, SEMWEB.

[6]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[7]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[8]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[9]  Waleed Ammar,et al.  Extracting Scientific Figures with Distantly Supervised Neural Networks , 2018, JCDL.

[10]  Jacopo Urbani,et al.  VLog: A Rule Engine for Knowledge Graphs , 2019, SEMWEB.

[11]  Csongor Nyulas,et al.  WebProtégé: A Cloud-Based Ontology Editor , 2019, WWW.

[12]  AnHai Doan,et al.  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services , 2017, SIGMOD Conference.

[13]  Mark Rowan,et al.  Extracting Tables from Documents using Conditional Generative Adversarial Networks and Genetic Algorithms , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[14]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[15]  George Papadakis,et al.  Entity Resolution: Past, Present and Yet-to-Come , 2020, EDBT.

[16]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[17]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[18]  Doug Downey,et al.  TabEL: Entity Linking in Web Tables , 2015, SEMWEB.

[19]  Luigi Bellomarini,et al.  VADA: an architecture for end user informed data preparation , 2019, Journal of Big Data.

[20]  Paolo Papotti,et al.  That's All Folks! LLUNATIC Goes Open Source , 2014, Proc. VLDB Endow..

[21]  Ali Farhadi,et al.  FigureSeer: Parsing Result-Figures in Research Papers , 2016, ECCV.

[22]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[23]  Wenhao Yu,et al.  Experimental Evidence Extraction System in Data Science with Hybrid Table Features and Ensemble Learning , 2020, WWW.

[24]  Ziqi Zhang,et al.  Effective and efficient Semantic Table Interpretation using TableMiner+ , 2017, Semantic Web.

[25]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[26]  Doug Downey,et al.  Construction of the Literature Graph in Semantic Scholar , 2018, NAACL.

[27]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[28]  Zhi Tang,et al.  Table Header Detection and Classification , 2012, AAAI.

[29]  Avishek Anand,et al.  TableNet: An Approach for Determining Fine-grained Relations for Wikipedia Tables , 2019, WWW.

[30]  Concetto Spampinato,et al.  A Saliency-based Convolutional Neural Network for Table and Chart Detection in Digitized Documents , 2018, ICIAP.

[31]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[32]  Ian Horrocks,et al.  ColNet: Embedding the Semantics of Web Tables for Column Type Prediction , 2018, AAAI.

[33]  Dominique Ritze,et al.  Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases , 2016, WWW.

[34]  Benno Kruit,et al.  Extracting Novel Facts from Tables for Knowledge Graph Completion (Extended version) , 2019, SEMWEB.

[35]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[36]  Kenny Q. Zhu,et al.  Cross-Lingual Entity Linking for Web Tables , 2018, AAAI.

[37]  Massimo Ruffolo,et al.  PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[38]  Boris Motik,et al.  Benchmarking the Chase , 2017, PODS.

[39]  Christopher Andreas Clark,et al.  PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).