Classification of keyphrases from scientific publications using WordNet and word embeddings

The ScienceIE task at SemEval-2017 introduced an epistemological classification of keyphrases in scientific publications, suggesting that research activities revolve around the key concepts of process (methods and systems), material (data and physical resources) and task. In this paper we present a method for the classification of keyphrases according to the ScienceIE classification, using WordNet and word embeddings derived features. The method outperforms the best system at SemEval-2017, although our experiments highlighted some issues with the collection. RÉSUMÉ. Dans le contexte du challenge ScienceIE à SemEval-2017, ses organisateurs ont introduit une classification des phrases clés dans les publications scientifiques. Selon leur hypothèse, les activités de recherche tournent autour des concepts clés de “process" (methodes, systèmes), “material" (ressources matériellles, données, produits) et “task" (problèmes, activités à poursuivre). Dans cet article, nous présentons une méthode pour la classification des phrases clés selon la classification donné par ScienceIE, en utilisant des caractéristiques dérivées à partir de WordNet et de “word embeddings". La méthode proposée dépasse le meilleur système au SemEval-2017; toutefois, nos expériences ont mis en évidence certains problèmes d’annotation avec la collection.

[1]  Enrico Motta,et al.  Klink-2: Integrating Multiple Web Sources to Generate Semantic Topic Networks , 2015, SEMWEB.

[2]  Dietmar Wolfram Bibliometrics, Information Retrieval and Natural Language Processing: Natural Synergies to Support Digital Library Research , 2016, BIRNDL@JCDL.

[3]  Isabelle Augenstein,et al.  SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications , 2017, *SEMEVAL.

[4]  Isabelle Tellier,et al.  Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientific Literature , 2016, LREC.

[5]  Thomas Demeester,et al.  Representation learning for very short texts using weighted word embedding aggregation , 2016, Pattern Recognit. Lett..

[6]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  Bo-Christer Björk,et al.  Scientific journal publishing: yearly volume and open access availability , 2009, Inf. Res..

[9]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[12]  Davide Buscaldi,et al.  LIPN at SemEval-2017 Task 10: Filtering Candidate Keyphrases from Scientific Publications with Part-of-Speech Tag Sequences to Train a Sequence Labeling Model , 2017, SemEval@ACL.