Scalable Link-Based Similarity Computation and Clustering

Data objects in a relational database are cross-linked with each other via multi-typed links. Links contain rich semantic information that may indicate important relationships among objects, such as the similarities between objects. In this chapter we explore linkage-based clustering, in which the similarity between two objects is measured based on the similarities between the objects linked with them. We study a hierarchical structure called SimTree, which represents similarities in multi-granularity manner. This method avoids the high cost of computing and storing pairwise similarities but still thoroughly explore relationships among objects. We introduce an efficient algorithm for computing similarities utilizing the SimTree.

[1]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[2]  Yair Bartal,et al.  On approximating arbitrary metrices by tree metrics , 1998, STOC '98.

[3]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[4]  Hongjun Lu,et al.  ReCoM: reinforcement clustering of multi-type interrelated data objects , 2003, SIGIR.

[5]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[6]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[7]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[8]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[9]  Mathias Kirsten,et al.  Relational Distance-Based Clustering , 1998, ILP.

[10]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[11]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[12]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[13]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[14]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[15]  Satish Rao,et al.  A tight bound on approximating arbitrary metrics by tree metrics , 2003, STOC '03.

[16]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[17]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[18]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[19]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[20]  Ran El-Yaniv,et al.  Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[21]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[22]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.