Chinese Blog Clustering by Hidden Sentiment Factors

In the Web age, blogs have become the major platform for people to express their opinions and sentiments. The traditional blog clustering methods usually group blogs by keywords, stories or timelines, which do not consider opinions and emotions expressed in the articles. In this paper, a novel method based on Probabilistic Latent Semantic Analysis (PLSA) is presented to model the hidden emotion factors and an emotion-oriented clustering approach is proposed according to the sentiment similarities between Chinese blogs. Extensive experiments were conducted on real world blog datasets with different topics and the results show that our approach can cluster Chinese blogs into sentiment coherent groups to allow for better organization and easy navigation.

[1]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[2]  Ravi Kumar,et al.  Structure and evolution of blogspace , 2004, CACM.

[3]  Hsin-Hsi Chen,et al.  Mining opinions from the Web: Beyond relevance retrieval , 2007 .

[4]  Judit Bar-Ilan An outsider's view on "topic-oriented blogging" , 2004, WWW Alt. '04.

[5]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[6]  Qiang Yang,et al.  Latent Friend Mining from Blog Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Qiang Yang,et al.  Exploring in the weblog space by detecting informative and affective articles , 2007, WWW '07.

[8]  James Allan,et al.  Interactive Clustering of Text Collections According to a User-Specified Criterion , 2007, IJCAI.

[9]  Yue Lu,et al.  Opinion integration through semi-supervised topic modeling , 2008, WWW.

[10]  ChengXiang Zhai,et al.  A mixture model for contextual text mining , 2006, KDD '06.

[11]  Matthew Hurst,et al.  BlogPulse: Automated Trend Discovery for Weblogs , 2003 .

[12]  Frank Wm. Tompa,et al.  Seeking Stable Clusters in the Blogosphere , 2007, VLDB.

[13]  Edward Y. Chang,et al.  Mining blog stories using community-based and temporal clustering , 2006, CIKM '06.

[14]  Xiaohui Yu,et al.  ARSA: a sentiment-aware model for predicting sales performance using blogs , 2007, SIGIR.

[15]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.