Expanding Science and Technology Thesauri from Bibliographic Datasets Using Word Embedding

The use of thesauri and taxonomies for science and technology information in scientometrics has been attracting attention. However, manual construction and maintenance of thesauri is expensive and requires significant time, thus, methods for semi-automatic construction and maintenance are being actively studied. We propose a method to expand an existing thesaurus using the abstracts of articles from state-of-the-art technological domains with limited structured information. Specifically, we consider a method for properly allocating new terms to the hierarchical structures of an existing thesaurus using rapidly evolving word embedding. In an experiment, word vectors of 500 degrees are constructed from 567,000 biomedical articles and are clustered after dimension reduction using principal component analysis. Then, semantic relations are estimated based on the spatial relations between the new term and any of the terms in the thesaurus. We then conducted a comparison of the results obtained from three experts. In future, we will develop a recommendation system for new terms related to the existing terms to support semi-automatic thesaurus maintenance.

[1]  Takahiro Kawamura,et al.  TEXT2LOD: -- Development of Web API for Triplification of Text Information --@@@~ テキスト情報のLOD化に向けたWeb APIの開発 ~ , 2016 .

[2]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[3]  Takashi Chikayama,et al.  Simple Customization of Recursive Neural Networks for Semantic Relation Classification , 2013, EMNLP.

[4]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[5]  Dilek Z. Hakkani-Tür,et al.  Enriching Word Embeddings Using Knowledge Graph for Semantic Tagging in Conversational Dialog Systems , 2015, AAAI Spring Symposia.

[6]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[7]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Enrico Motta,et al.  Klink-2: Integrating Multiple Web Sources to Generate Semantic Topic Networks , 2015, SEMWEB.

[10]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[11]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[12]  Takahiro Kawamura,et al.  J-GLOBAL knowledge: Japan's Largest Linked Open Data for Science and Technology , 2015, SEMWEB.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[15]  James Richard Curran,et al.  From distributional to semantic similarity , 2004 .

[16]  Takahiro Kawamura,et al.  Development of Web Service for Japanese Text Triplification , 2016, New Generation Computing.

[17]  Ted Briscoe,et al.  Looking for Hyponyms in Vector Space , 2014, CoNLL.

[18]  Chao Li,et al.  Acronym Disambiguation Using Word Embedding , 2015, AAAI.