论文信息 - Similar Cluster Based Continuous Bag-of-Words for Word Vector Training

Similar Cluster Based Continuous Bag-of-Words for Word Vector Training

With the increasing use of natural language processing, how to build a word vector which contains more semantic information becomes a top priority. Word vector is used to represent the most basic unit in the language-word, and is the basis of the neural natural language processing model. Therefore the quality of word vectors directly affects the performance of various applications. In continuous bag-of-words model, limited by the frequency of occurrence, some words do not get enough training. At the same time, based on the set of minimum frequency, some low-frequency words are ignored by the model. In this paper, we build similar clusters from the semantic dictionary and integrate it into CBOW model with the help of multi-classifier. We improve word vectors and use it to complete a semantic similarity comparison task. Compared with the original word vectors built by CBOW, the method we proposed got higher accuracy. It shows some semantic information are integrated and the word vectors of low–frequency words are improved.

Yinghua Ma | Shenghong Li | Shiyi Zhang | Weikai Sun

[1] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[3] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[4] Yu Xue,et al. Text classification based on deep belief network and softmax regression , 2016, Neural Computing and Applications.

[5] I A Basheer,et al. Artificial neural networks: fundamentals, computing, design, and application. , 2000, Journal of microbiological methods.

[6] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.