Improvement of Short Text Clustering Based on Weighted Word Embeddings

The data sparseness problem in short text clustering will causes low clustering performance. One solution is to enrich short text according to the semantic relationship from external text corpus. A new one is neural network based text representation learning which is word embeddibngs. In this paper, we studied the methods of vector to represent a short text. But how to get a vector from word embeddings is a challenge job. One way is average sum of vectors, but it ignores the importance of the terms. TF-IDF weighted vectors is better way, but the sparseness of terms makes the local IDF not sufficient. We proposed a new method with TF-GIDF weighted vectors, which use global IDF to conquer the shortcoming. The experiments are set up to compare the new method with baselines and the results analysis shows that the proposed method outperforms baselines significantly.