A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model

In bibliometric research, keyword analysis of publications provides an effective way not only to investigate the knowledge structure of research domains, but also to explore the developing trends within domains. To identify the most representative keywords, many approaches have been proposed. Most of them focus on using statistical regularities, syntax, grammar, or network-based characteristics to select representative keywords for the domain analysis. In this paper, we argue that the domain knowledge is reflected by the semantic meanings behind keywords rather than the keywords themselves. We apply the Google Word2Vec model, a model of a word distribution using deep learning, to represent the semantic meanings of the keywords. Based on this work, we propose a new domain knowledge approach, the Semantic Frequency-Semantic Active Index, similar to Term Frequency-Inverse Document Frequency, to link domain and background information and identify infrequent but important keywords. We adopt a semantic similarity measuring process before statistical computation to compute the frequencies of “semantic units” rather than keyword frequencies. Semantic units are generated by word vector clustering, while the Inverse Document Frequency is extended to include the semantic inverse document frequency; thus only words in the inverse documents with a certain similarity will be counted. Taking geographical natural hazards as the domain and natural hazards as the background discipline, we identify the domain-specific knowledge that distinguishes geographical natural hazards from other types of natural hazards. We compare and discuss the advantages and disadvantages of the proposed method in relation to existing methods, finding that by introducing the semantic meaning of the keywords, our method supports more effective domain knowledge analysis.

[1]  Stefan Burr,et al.  The Mathematics of networks , 1982 .

[2]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[3]  Ang Li,et al.  Research on the semantic-based co-word analysis , 2011, Scientometrics.

[4]  Pei-Chun Lee,et al.  Mapping knowledge structure by keyword co-occurrence: a first look at journal papers in Technology Foresight , 2010, Scientometrics.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Guo Chen,et al.  Selecting publication keywords for domain analysis in bibliometrics: A comparison of three methods , 2016, J. Informetrics.

[7]  Abram Handler,et al.  An empirical study of semantic similarity in WordNet and Word2Vec , 2014 .

[8]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[9]  Stephen P. Borgatti,et al.  Centrality and network flow , 2005, Soc. Networks.

[10]  Qing-yun Dai,et al.  Research of DSP-based Embedded Systems Connected to the Internet , 2013 .

[11]  Chaomei Chen,et al.  CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature , 2006, J. Assoc. Inf. Sci. Technol..

[12]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[13]  Gobinda G. Chowdhury,et al.  Bibliometric cartography of information retrieval research by using co-word analysis , 2001, Inf. Process. Manag..

[14]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[15]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[16]  M. Newman Mathematics of networks , 2018, Oxford Scholarship Online.

[17]  Lingling Meng,et al.  A Review of Semantic Similarity Measures in WordNet 1 , 2013 .

[18]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Ju Wang,et al.  Visualizing the research on pervasive and ubiquitous computing , 2011, Scientometrics.

[22]  Luc Quoniam,et al.  Bibliometric law used for information retrieval , 2006, Scientometrics.

[23]  Guo Chen,et al.  Identifying the research focus of Library and Information Science institutions in China with institution-specific keywords , 2015, Scientometrics.

[24]  Jia Feng,et al.  Improving the co-word analysis method based on semantic distance , 2017, Scientometrics.

[25]  Dietmar Wolfram,et al.  Visualizing the intellectual structure of information science (2006-2015): Introducing author keyword coupling analysis , 2016, J. Informetrics.