Extracting Semantic Relations for Scholarly Knowledge Base Construction

The problem of information extraction from scientific articles, found as PDF documents in large digital repositories, is gaining more attention as the amount of research findings continues to grow. We propose a system to extract semantic relations among entities in scholarly articles by making use of external syntactic patterns and an iterative learner. While information extraction from scholarly documents have been studied before, it has been focused mainly on the abstract and keywords. Our method extracts semantic entities as concepts and instances along with their attributes from the fully body text of documents. We extract two types of relationships between concepts in the text using an iterative learning algorithm. External data sources from the web such as the Microsoft concept graph, as well as query logs, are utilized to evaluate the quality of the extracted concepts and relations. The concepts are used to construct a scientific taxonomy covering the research content of the documents. To evaluate the system we apply our approach on a set of 10k scholarly documents and conduct several evaluations to show the effectiveness of the proposed methods. We show that our system obtains a 23% improvement in precision over existing web IE tools when they are applied to scholarly documents.

[1]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[2]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[3]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[4]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[5]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[6]  Dragomir R. Radev,et al.  Blind men and elephants: What do citation summaries tell us about a research article? , 2008 .

[7]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[8]  Isabelle Augenstein,et al.  SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications , 2017, *SEMEVAL.

[9]  Matthias Hemmje,et al.  Combining Taxonomies using Word2vec , 2016, DocEng.

[10]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[11]  C. Lee Giles,et al.  Automatic Knowledge Base Construction from Scholarly Documents , 2017, DocEng.

[12]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[13]  Christopher D. Manning,et al.  Leveraging Linguistic Structure For Open Domain Information Extraction , 2015, ACL.

[14]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[15]  Benjamin Van Durme,et al.  What You Seek Is What You Get: Extraction of Class Attributes from Query Logs , 2007, IJCAI.

[16]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[17]  Seung-won Hwang,et al.  Attribute extraction and scoring: A probabilistic approach , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[18]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[19]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[20]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[21]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[22]  Xiang Li,et al.  Commonsense Knowledge Base Completion , 2016, ACL.