论文信息 - HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities

HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities

We investigate a variant of the problem of automatic keyphrase extraction from scientific documents, which we define as Scientific Domain Knowledge Entity (SDKE) extraction. Keyphrases are noun phrases important to the documents themselves. In contrast, an SDKE is text that refers to a concept and can be classified as a process, material, task, dataset etc. A SDKE represents domain knowledge, but is not necessarily important to the document it is in. Supervised keyphrase extraction algorithms using non-sequential classifiers and global measures of informativeness (PMI, tf-idf) have been used for this task. Another approach is to use sequential labeling algorithms with local context from a sentence, as done in the named entity recognition. We show that these two methods can complement each other and a simple merging can improve the extraction accuracy by 5-7 percentiles. We further propose several heuristics to improve the extraction accuracy. Our preliminary experiments suggest that it is possible to improve the accuracy of the sequential learner itself by utilizing the predictions of the non-sequential model.

[1] Ian H. Witten,et al. Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[2] Jiawei Han,et al. FacetGist: Collective Extraction of Document Facets in Large Technical Corpora , 2016, CIKM.

[3] Nick Cramer,et al. Automatic Keyword Extraction from Individual Documents , 2010 .

[4] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5] Cornelia Caragea,et al. Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach , 2014, EMNLP.

[6] Xiaoli Li,et al. Keyphrase Extraction using Sequential Labeling , 2016, ArXiv.

[7] Christopher D. Manning,et al. Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers , 2011, IJCNLP.

[8] Santiago Pujol,et al. Increasing datasets discoverability in an engineering data platform using keyword extraction , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[9] Dan Roth,et al. Integer linear programming inference for conditional random fields , 2005, ICML.

[10] Sabine Buchholz,et al. Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[11] Hung-Hsuan Chen,et al. ExpertSeer: a Keyphrase Based Expert Recommender for Digital Libraries , 2015, ArXiv.