WisColl: Collective wisdom based blog clustering

The Blogosphere is expanding in an unprecedented speed. A better understanding of the blogosphere can greatly facilitate the development of the Social Web to serve the needs of users, service providers, and advertisers. One important task in this process is clustering blog sites. Although a good number of traditional clustering methods exists, they are not designed to take into account the blogosphere unique characteristics. Clustering blog sites presents new challenges. A prominent feature of the Social Web is that many enthusiastic bloggers voluntarily write, tag, and catalog their posts in order to reach the widest possible audience who will share their thoughts and appreciate their ideas. In the process a new kind of collective wisdom is generated. We propose WisColl by tapping into this collective wisdom when clustering blog sites. In this paper, we study how clustering with collective wisdom can be achieved and compare it with a representative traditional clustering method. We present statistical and visual results, report findings and suggest future work extending to many real-world applications.

[1]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[2]  Ashwin Ram,et al.  Efficient Feature Selection in Conceptual Clustering , 1997, ICML.

[3]  B. Tseng,et al.  Tomographic Clustering To Visualize Blog Communities as Mountain Views , 2005 .

[4]  Christopher H. Brooks,et al.  Improved annotation of the blogosphere via autotagging and hierarchical clustering , 2006, WWW '06.

[5]  P. Nurmi Mixture Models , 2008 .

[6]  R. Armstrong The Long Tail: Why the Future of Business Is Selling Less of More , 2008 .

[7]  Edward Y. Chang,et al.  Mining blog stories using community-based and temporal clustering , 2006, CIKM '06.

[8]  Philip Yu,et al.  Searching for “ Familiar Strangers ” on Blogosphere : Problems and Challenges , 2007 .

[9]  Huan Liu,et al.  A Social Identity Approach to Identify Familiar Strangers in a Social Network , 2009, ICWSM.

[10]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[11]  Wolfgang Nejdl,et al.  How valuable is medical social media data? Content analysis of the medical web , 2009, Inf. Sci..

[12]  Wei Song,et al.  Genetic Algorithm-based Text Clustering Technique: Automatic Evolution of Clusters with High Efficiency , 2006, 2006 Seventh International Conference on Web-Age Information Management Workshops.

[13]  Hong-Gee Kim,et al.  Exploiting noun phrases and semantic relationships for text document clustering , 2009, Inf. Sci..

[14]  Andreas Hotho,et al.  Information Retrieval in Folksonomies: Search and Ranking , 2006, ESWC.

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[17]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[18]  Yun Chi,et al.  Discovery of Blog Communities based on Mutual Awareness , 2006 .

[19]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[20]  Frank Wm. Tompa,et al.  Seeking Stable Clusters in the Blogosphere , 2007, VLDB.

[21]  Philip S. Yu,et al.  Mining Community Structure of Named Entities from Web Pages and Blogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[22]  Shuting Xu,et al.  A Parallel Hybrid Web Document Clustering Algorithm and its Performance Study , 2004, The Journal of Supercomputing.

[23]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[24]  Tim O'Reilly,et al.  What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software , 2007 .

[25]  Paolo Avesani,et al.  An Analysis of the Use of Tags in a Blog Recommender System , 2007, IJCAI.

[26]  Mark H. Chignell,et al.  A social hypertext model for finding community in blogs , 2006, HYPERTEXT '06.

[27]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .