Predicting Abstract Keywords by Word Vectors

The continuous development of the information technology leads to the explosive growth of many information domains. Obtaining the required information from a large-scale text in a quick and accurate way has become a great challenge. Keyword extraction is a kind of effective method to solve these problems. It is one of the core technologies in the research area of text mining, and plays a very important role. Currently, the keywords of most text information have not been provided. Some keywords of a text are not contained in the text content. There is not any elegant solution, offered by the existing algorithms, for this problem yet. To solve it, this paper proposes a keyword extraction method based on word vectors. The concept of a text turns into computer understandable space by training word vectors using a word2vec algorithm. This method trains all the words and keywords which appear in the text into vector sets through the word2vec training method, and then the words in the test text will be replaced by word term vectors. The Euclidean distances between every candidate words and every text words are calculated to find out the top-N-closest keywords as the automatic text extraction keywords. The experiment uses computer field papers as a training text. The results show that the method can improve the accuracy of the phrase keyword extraction and find the keywords not appearing in the text.

[1]  Wang Ting Research on the Chinese Keyword Extraction Algorithm Based on Separate Models , 2009 .

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[4]  Anette Hulth Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction , 2004 .

[5]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[7]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[8]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[9]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[10]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[11]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[12]  Li Su Research on Maximum Entropy Model for Keyword Indexing , 2004 .

[13]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[14]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[15]  Joshua Goodman,et al.  Finding advertising keywords on web pages , 2006, WWW '06.

[16]  Xiaolin Wang,et al.  Improved TF-IDF Keyword Extraction Algorithm * , 2013 .

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.