Similar Cluster Based Continuous Bag-of-Words for Word Vector Training

With the increasing use of natural language processing, how to build a word vector which contains more semantic information becomes a top priority. Word vector is used to represent the most basic unit in the language-word, and is the basis of the neural natural language processing model. Therefore the quality of word vectors directly affects the performance of various applications. In continuous bag-of-words model, limited by the frequency of occurrence, some words do not get enough training. At the same time, based on the set of minimum frequency, some low-frequency words are ignored by the model. In this paper, we build similar clusters from the semantic dictionary and integrate it into CBOW model with the help of multi-classifier. We improve word vectors and use it to complete a semantic similarity comparison task. Compared with the original word vectors built by CBOW, the method we proposed got higher accuracy. It shows some semantic information are integrated and the word vectors of low–frequency words are improved.