K-Means Based Approaches to Clustering Nodes in Annotated Graphs

The goal of clustering is to form groups of similar elements. Quality criteria for clusterings, as well as the notion of similarity, depend strongly on the application domain, which explains the existence of many different clustering algorithms and similarity measures. In this paper we focus on the problem of clustering annotated nodes in a graph, when the similarity between nodes depends on both their annotations and their context in the graph ("hybrid" similarity), using k-means-like clustering algorithms. We show that, for the similarity measure we focus on, k-means itself cannot trivially be applied. We propose three alternatives, and evaluate them empirically on the Cora dataset. We find that using these alternative clustering algorithms with the hybrid similarity can be advantageous over using standard k-means with a purely annotation-based similarity.

[1]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[2]  Jan Ramon Thesis: clustering and instance based learning in first order logic , 2002 .

[3]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[4]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[5]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[8]  Lawrence B. Holder,et al.  Substructure Discovery Using Minimum Description Length and Background Knowledge , 1993, J. Artif. Intell. Res..

[9]  Hendrik Blockeel,et al.  A method to extend existing document clustering procedures in order to include relational information , 2008, MLG 2008.

[10]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[11]  Jiawei Han,et al.  Geographic Data Mining and Knowledge Discovery , 2001 .

[12]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[13]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[14]  Robert E. Tarjan,et al.  Graph Clustering and Minimum Cut Trees , 2004, Internet Math..

[15]  Peter H. A. Sneath,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[16]  Y. Dodge on Statistical data analysis based on the L1-norm and related methods , 1987 .

[17]  Mathias Kirsten,et al.  Extending K-Means Clustering to First-Order Representations , 2000, ILP.

[18]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.