Semi-supervised Clustering of Graph Objects: A Subgraph Mining Approach

Semi-supervised clustering has recently received a lot of attention in the literature, which aims to improve the clustering performance with limited supervision. Most existing semi-supervised clustering studies assume that the data is represented in a vector space, e.g., text and relational data. When the data objects have complex structures, e.g., proteins and chemical compounds, those semi-supervised clustering methods are not directly applicable to clustering such graph objects. In this paper, we study the problem of semi-supervised clustering of data objects which are represented as graphs. The supervision information is in the form of pairwise constraints of must-links and cannot-links. As there is no predefined feature set for the graph objects, we propose to use discriminative subgraph patterns as the features. We design an objective function which incorporates the constraints to guide the subgraph feature mining and selection process. We derive an upper bound of the objective function based on which, a branch-and-bound algorithm is proposed to speedup subgraph mining. We also introduce a redundancy measure into the feature selection process in order to reduce the redundancy in the feature set. When the graph objects are represented in the vector space of the discriminative subgraph features, we use semi-supervised kernel K-means to cluster all graph objects. Experimental results on real-world protein datasets demonstrate that the constraint information can effectively guide the feature selection and clustering process and achieve satisfactory clustering performance.

[1]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[2]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[3]  Philip S. Yu,et al.  Semi-supervised feature selection for graph classification , 2010, KDD.

[4]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[6]  Wei Wang,et al.  GAIA: graph classification using evolutionary computation , 2010, SIGMOD Conference.

[7]  Philip S. Yu,et al.  Mining significant graph patterns by leap search , 2008, SIGMOD Conference.

[8]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[9]  Srinivasan Parthasarathy,et al.  Scalable graph clustering using stochastic flows: applications to community discovery , 2009, KDD.

[10]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[11]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[12]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[13]  Ambuj K. Singh,et al.  GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[14]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[16]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[17]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[18]  Joost N. Kok,et al.  A quickstart in frequent structure mining can make a difference , 2004, KDD.

[19]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[20]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.