论文信息 - Improvement of Short Text Clustering Based on Weighted Word Embeddings

Improvement of Short Text Clustering Based on Weighted Word Embeddings

The data sparseness problem in short text clustering will causes low clustering performance. One solution is to enrich short text according to the semantic relationship from external text corpus. A new one is neural network based text representation learning which is word embeddibngs. In this paper, we studied the methods of vector to represent a short text. But how to get a vector from word embeddings is a challenge job. One way is average sum of vectors, but it ignores the importance of the terms. TF-IDF weighted vectors is better way, but the sparseness of terms makes the local IDF not sufficient. We proposed a new method with TF-GIDF weighted vectors, which use global IDF to conquer the shortcoming. The experiments are set up to compare the new method with baselines and the results analysis shows that the proposed method outperforms baselines significantly.

[1] Geoffrey E. Hinton,et al. Three new graphical models for statistical language modelling , 2007, ICML '07.

[2] Naren Ramakrishnan,et al. Characterizing Diseases from Unstructured Text: A Vocabulary Driven Word2vec Approach , 2016, CIKM.

[3] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5] Steffen Staab,et al. Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[6] Somnath Banerjee,et al. Clustering short texts using wikipedia , 2007, SIGIR.

[7] Peng Wang,et al. Short Text Clustering via Convolutional Neural Networks , 2015, VS@HLT-NAACL.

[8] Danushka Bollegala,et al. Measuring semantic similarity between words using web search engines , 2007, WWW '07.

[9] Susumu Horiguchi,et al. Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[10] Andrew Y. Ng,et al. Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[11] Bin Luo,et al. Word Embedding Based Document Similarity for the Inferring of Penalty , 2018, WISA.