Efficient link-based clustering in a large scaled blog network

In this paper, we address efficient processing of link-based clustering in large-scaled data environment. LinkClus is a link-based clustering method that provides good accuracy and reasonable performance. This paper first shows that this method is not sufficiently scalable to be applied to a huge volume of real-world blog data. Then, we observe that the performance bottleneck of LinkClus exists on the initial clustering step. We propose a new method to get over this performance bottleneck. The proposed method first identifies the seed sets for initial clustering efficiently. Here, each seed set consists of a small number (=2~3) of objects that are highly similar to one another. The method then adds every other object into one of seed sets that are the most similar to the object. It also eliminates those objects of very few links that negatively affect the accuracy, thereby enhancing the overall processing performance. Via experiments with real-world blog data, we verify the scalability and accuracy of the proposed method.

[1]  Hongjun Lu,et al.  ReCoM: reinforcement clustering of multi-type interrelated data objects , 2003, SIGIR.

[2]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[3]  Mark H. Chignell,et al.  A social hypertext model for finding community in blogs , 2006, HYPERTEXT '06.

[4]  Tao Li,et al.  Diva: a variance-based clustering approach for multi-type relational data , 2007, CIKM '07.

[5]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[6]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[7]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[8]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[9]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[10]  Yizhou Sun,et al.  P-Rank: a comprehensive structural similarity measure over information networks , 2009, CIKM.

[11]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[12]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[13]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[14]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..