Instant Message Clustering Based on Extended Vector Space Model

Instant intercommunion techniques such as Instant Messaging (IM) are widely popularized. Aiming at such kind of large scale mass-communication media, clustering on its text content is a practical method to analyze the characteristic of text content in instant messages, and find or track the social hot topics. However, key words in one instant message usually are few, even latent; moreover, single message can not describe the conversational context. This is very different from general document and makes common clustering algorithms unsuitable. A novel method called WR-KMeans is proposed, which synthesizes related instant messages as a conversation and enriches conversation's vector by words which are not included in this conversation but are closely related with existing words in this conversation. WR-KMeans performs clustering like k-means on this extended vector space of conversations. Experiments on the public datasets show that WR-KMeans outperforms the traditional k-means and bisecting k-means algorithms.