Chinese Web Comments Clustering Analysis with a Two-phase Method

Usually a meaningful web topic has tens of thousands of comments, especially the hot topics. It is valuable if we congregate the comments into clusters and find out the mainstreams. However, such analysis has two difficulties. First, there is no explicit link relationship between web comments just like those among web pages or Blog comments. The other problem is, most of the comments are very short, even one or two words. Therefore the traditional clustering algorithms such as CURE and DBSCAN cannot work if applied to these comments directly. In this paper we propose a two-phase algorithm, which will first combine the highly synonymous comments into a longer one based on a connected graph model, and then apply the improved clustering methods to the new collections. Experimental results on two real data sets show that our algorithm performs better than traditional algorithms such as CURE.

[1]  Lipo Wang,et al.  Data Mining With Computational Intelligence , 2006, IEEE Transactions on Neural Networks.

[2]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[3]  Xu Ling,et al.  Topic sentiment mixture: modeling facets and opinions in weblogs , 2007, WWW '07.

[4]  Daewon Lee,et al.  An improved cluster labeling method for support vector clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[6]  Gilad Mishne,et al.  MoodViews: Tools for Blog Mood Analysis , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[7]  Xiaoyan Zhu,et al.  Movie review mining and summarization , 2006, CIKM '06.

[8]  Gilad Mishne,et al.  Predicting Movie Sales from Blogger Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[9]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[10]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[11]  Yue Lu,et al.  Opinion integration through semi-supervised topic modeling , 2008, WWW.

[12]  Bing Liu,et al.  An efficient semi-unsupervised gene selection method via spectral biclustering , 2006, IEEE Transactions on NanoBioscience.

[13]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[14]  Bing Liu,et al.  Mining Opinion Features in Customer Reviews , 2004, AAAI.

[15]  V.S. Tseng,et al.  Efficiently mining gene expression data via a novel parameterless clustering method , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Ko Fujimura,et al.  The EigenRumor Algorithm for Ranking Blogs , 2005 .

[17]  Daewon Lee,et al.  Dynamic Characterization of Cluster Structures for Robust and Inductive Support Vector Clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.