Creating a Scholarly Knowledge Graph from Survey Article Tables

Due to the lack of structure, scholarly knowledge remains hardly accessible for machines. Scholarly knowledge graphs have been proposed as a solution. Creating such a knowledge graph requires manual effort and domain experts, and is therefore time-consuming and cumbersome. In this work, we present a human-in-the-loop methodology used to build a scholarly knowledge graph leveraging literature survey articles. Survey articles often contain manually curated and high-quality tabular information that summarizes findings published in the scientific literature. Consequently, survey articles are an excellent resource for generating a scholarly knowledge graph. The presented methodology consists of five steps, in which tables and references are extracted from PDF articles, tables are formatted and finally ingested into the knowledge graph. To evaluate the methodology, 92 survey articles, containing 160 survey tables, have been imported in the graph. In total, 2,626 papers have been added to the knowledge graph using the presented methodology. The results demonstrate the feasibility of our approach, but also indicate that manual effort is required and thus underscore the important role of human experts.

[1]  Andrew S. Denney,et al.  How to Write a Literature Review , 2013 .

[2]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[3]  Dominique Ritze,et al.  A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[4]  C. Hart Doing a literature review: releasing the social science research imagination. , 1998 .

[5]  Christoph Lange,et al.  Towards a Knowledge Graph Representing Research Findings by Semantifying Survey Articles , 2017, TPDL.

[6]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[7]  Roman Kern,et al.  Unsupervised document structure analysis of digital scientific articles , 2014, International Journal on Digital Libraries.

[8]  Eero Hyvönen,et al.  How to deal with massively heterogeneous cultural heritage data - lessons learned in CultureSampo , 2012, Semantic Web.

[9]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[10]  Gustav Rosén,et al.  Analysis of Tabula : A PDF-Table extraction tool , 2019 .

[11]  David Banister,et al.  How to Write a Literature Review Paper? , 2016 .

[12]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[13]  Tamir Hassan,et al.  Table Recognition and Understanding from PDF Files , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[14]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[15]  Eero Hyvönen,et al.  Publishing and Using Cultural Heritage Linked Data on the Semantic Web , 2012, Synthesis Lectures on the Semantic Web.

[16]  Dimitrios Tzovaras,et al.  Extraction of Tabular Data from Document Images , 2017, W4A.

[17]  Andreiwid Sheffer Corrêa,et al.  Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools , 2017, DG.O.

[18]  Richard T. Watson,et al.  Analyzing the Past to Prepare for the Future: Writing a Literature Review , 2002, MIS Q..

[19]  Maria-Esther Vidal,et al.  Semantic Representation of Scientific Publications , 2019, TPDL.

[20]  A. Oelen,et al.  Generate FAIR Literature Surveys with Scholarly Knowledge Graphs , 2020, JCDL.

[21]  Ruben Verborgh,et al.  Using OpenRefine , 2013 .

[22]  Flávio S. Corrêa da Silva,et al.  Transparency portals versus open government data: an assessment of openness in Brazilian municipalities , 2014, DG.O.

[23]  Xiaohu Yang,et al.  Converting PDF to HTML approach based on text detection , 2009, ICIS '09.

[24]  Steffen Staab,et al.  Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications , 2017, ADBIS.

[25]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[26]  Christoph Lange,et al.  Crowdsourced semantic annotation of scientific publications and tabular data in PDF , 2015, SEMANTICS.

[27]  B. Mons,et al.  Nano-Publication in the e-science era , 2009 .

[28]  Sören Auer,et al.  Open Research Knowledge Graph: Next Generation Infrastructure for Semantic Scholarly Knowledge , 2019, K-CAP.

[29]  Petr Knoth,et al.  An Analysis of the Microsoft Academic Graph , 2016, D Lib Mag..

[30]  Rachael Lammey CrossRef text and data mining services , 2015 .

[31]  Hye-Young Paik,et al.  TEXUS: A Task-based Approach for Table Extraction and Understanding , 2015, DocEng.

[32]  Sören Auer,et al.  Comparing Research Contributions in a Scholarly Knowledge Graph , 2019, SciKnow@K-CAP.

[33]  Ian Horrocks,et al.  Publishing the Norwegian Petroleum Directorate's FactPages as Semantic Web Data , 2013, SEMWEB.