Hyponym/Hypernym Detection in Science and Technology Thesauri from Bibliographic Datasets

Thesauri for science and technology information are increasingly used in bibliometrics and scientometrics. However, the manual construction and maintenance of thesauri is costly and time consuming, thus, methods for semi-automatic construction and maintenance are being actively studied. We propose a method that expands an existing thesaurus with specified terms extracted from the abstracts of articles. Specifically, we assign the terms to specified subcategories by clustering a word vector space, then determine the hyponyms and hypernyms based on their relations with terms in the sub-categories. The word vectors are constructed from 177,000 IEEE articles archived from 2012 to 2014 in the Scopus dataset. In experiments, the terms were correctly classified into the Japan Science and Technology thesaurus with 70.8% precision and 75.4% recall. In future, we will develop a semiautomatic thesaurus maintenance system that recommends new terms in their proper relative positions.

[1]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[2]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[5]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[6]  Jason Weston,et al.  Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction , 2013, EMNLP.

[7]  Claudio Giuliano,et al.  Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature , 2006, EACL.

[8]  Takahiro Kawamura,et al.  J-GLOBAL knowledge: Japan's Largest Linked Open Data for Science and Technology , 2015, SEMWEB.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Enrico Motta,et al.  Klink-2: Integrating Multiple Web Sources to Generate Semantic Topic Networks , 2015, SEMWEB.

[11]  James Richard Curran,et al.  From distributional to semantic similarity , 2004 .

[12]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[13]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[14]  Chao Li,et al.  Acronym Disambiguation Using Word Embedding , 2015, AAAI.

[15]  Matthias Samwald,et al.  Exploring the Application of Deep Learning Techniques on Medical Text Corpora , 2014, MIE.

[16]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[17]  Dilek Z. Hakkani-Tür,et al.  Enriching Word Embeddings Using Knowledge Graph for Semantic Tagging in Conversational Dialog Systems , 2015, AAAI Spring Symposia.

[18]  Wanxiang Che,et al.  Learning Semantic Hierarchies via Word Embeddings , 2014, ACL.

[19]  Ted Briscoe,et al.  Looking for Hyponyms in Vector Space , 2014, CoNLL.

[20]  Takashi Chikayama,et al.  Simple Customization of Recursive Neural Networks for Semantic Relation Classification , 2013, EMNLP.