Automatic Knowledge Base Construction from Scholarly Documents

The continuing growth of published scholarly content on the web ensures the availability of the most recent scientific findings to researchers. Scholarly documents, such as research articles, are easily accessed by using academic search engines that are built on large repositories of scholarly documents. Scientific information extraction from documents into a structured knowledge graph representation facilitates automated machine understanding of a document's content. Traditional information extraction approaches, that either require training samples or a preexisting knowledge base to assist in the extraction, can be challenging when applied to large repositories of digital documents. Labeled training examples for such large scale are difficult to obtain for such datasets. Also, most available knowledge bases are built from web data and do not have sufficient coverage to include concepts found in scientific articles. In this paper we aim to construct a knowledge graph from scholarly documents while addressing both these issues. We propose a fully automatic, unsupervised system for scientific information extraction that does not build on an existing knowledge base and avoids manually-tagged training data. We describe and evaluate a constructed taxonomy that contains over 15k entities resulting from applying our approach to 10k documents.

[1]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[2]  Xiang Li,et al.  Commonsense Knowledge Base Completion , 2016, ACL.

[3]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[4]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[5]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[6]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[7]  Steffen Staab,et al.  Learning Taxonomic Relations from Heterogeneous Sources of Evidence , 2005 .

[8]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[9]  James Ze Wang,et al.  Automated analysis of images in documents for intelligent document search , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[10]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[11]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[12]  C. Lee Giles,et al.  Automatic Extraction of Data from Bar Charts , 2015, K-CAP.

[13]  Cornelia Caragea,et al.  PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search , 2015, K-CAP.

[14]  Prasenjit Mitra,et al.  AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data , 2016, IEEE Transactions on Big Data.

[15]  C. Lee Giles,et al.  A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents , 2017, AAAI.

[16]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[17]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .