HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities

We investigate a variant of the problem of automatic keyphrase extraction from scientific documents, which we define as Scientific Domain Knowledge Entity (SDKE) extraction. Keyphrases are noun phrases important to the documents themselves. In contrast, an SDKE is text that refers to a concept and can be classified as a process, material, task, dataset etc. A SDKE represents domain knowledge, but is not necessarily important to the document it is in. Supervised keyphrase extraction algorithms using non-sequential classifiers and global measures of informativeness (PMI, tf-idf) have been used for this task. Another approach is to use sequential labeling algorithms with local context from a sentence, as done in the named entity recognition. We show that these two methods can complement each other and a simple merging can improve the extraction accuracy by 5-7 percentiles. We further propose several heuristics to improve the extraction accuracy. Our preliminary experiments suggest that it is possible to improve the accuracy of the sequential learner itself by utilizing the predictions of the non-sequential model.