Combining Structured Node Content and Topology Information for Networked Graph Clustering

Graphs are popularly used to represent objects with shared dependency relationships. To date, all existing graph clustering algorithms consider each node as a single attribute or a set of independent attributes, without realizing that content inside each node may also have complex structures. In this article, we formulate a new networked graph clustering task where a network contains a set of inter-connected (or networked) super-nodes, each of which is a single-attribute graph. The new super-node representation is applicable to many real-world applications, such as a citation network where each node denotes a paper whose content can be described as a graph, and citation relationships between papers form a networked graph (i.e., a super-graph). Networked graph clustering aims to find similar node groups, each of which contains nodes with similar content and structure information. The main challenge is to properly calculate the similarity between super-nodes for clustering. To solve the problem, we propose to characterize node similarity by integrating structure and content information of each super-node. To measure node content similarity, we use cosine distance by considering overlapped attributes between two super-nodes. To measure structure similarity, we propose an Attributed Random Walk Kernel (ARWK) to calculate the similarity between super-nodes. Detailed node content analysis is also included to build relationships between super-nodes with shared internal structure information, so the structure similarity can be calculated in a precise way. By integrating the structure similarity and content similarity as one matrix, the spectral clustering is used to achieve networked graph clustering. Our method enjoys sound theoretical properties, including bounded similarities and better structure similarity assessment than traditional graph clustering methods. Experiments on real-world applications demonstrate that our method significantly outperforms baseline approaches.

[1]  Frank Thomson Leighton,et al.  Graph Bisection Algorithms with Good Average Case Behavior , 1984, FOCS.

[2]  Leonid Zhukov,et al.  Clustering of bipartite advertiser-keyword graph , 2003 .

[3]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Gerhard Weikum,et al.  Graph-based text classification: learn from your neighbors , 2006, SIGIR.

[5]  Jun Huan,et al.  GPD: A Graph Pattern Diffusion Kernel for Accurate Graph Classification with Applications in Cheminformatics , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[8]  Fillia Makedon,et al.  Fast Nonnegative Matrix Tri-Factorization for Large-Scale Data Co-Clustering , 2011, IJCAI.

[9]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[10]  Inderjit S. Dhillon,et al.  A fast kernel-based multilevel algorithm for graph clustering , 2005, KDD '05.

[11]  Jimeng Sun,et al.  Fast Random Walk Graph Kernel , 2012, SDM.

[12]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[13]  M. A. Muñoz,et al.  Journal of Statistical Mechanics: An IOP and SISSA journal Theory and Experiment Detecting network communities: a new systematic and efficient algorithm , 2004 .

[14]  Sebastian Nowozin,et al.  gBoost: a mathematical programming approach to graph classification and regression , 2009, Machine Learning.

[15]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[16]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[17]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[18]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[19]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[20]  Andras Fiser,et al.  Functional clustering of immunoglobulin superfamily proteins with protein-protein interaction information calibrated hidden Markov model sequence profiles. , 2014, Journal of molecular biology.

[21]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[22]  Hong Cheng,et al.  Clustering Large Attributed Graphs: A Balance between Structural and Attribute Similarities , 2011, TKDD.

[23]  Jiawei Han,et al.  An attribute-oriented approach for learning classification rules from relational databases , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[24]  Mark Stamp,et al.  Deriving common malware behavior through graph clustering , 2013, Comput. Secur..

[25]  Thomas S. Huang,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation. , 2011, IEEE transactions on pattern analysis and machine intelligence.

[26]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[27]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[28]  David D. Jensen,et al.  Graph clustering with network structure indices , 2007, ICML '07.

[29]  Sun Ho Lee,et al.  Chromosomal Microarray Testing in 42 Korean Patients with Unexplained Developmental Delay, Intellectual Disability, Autism Spectrum Disorders, and Multiple Congenital Anomalies , 2017, Genomics & informatics.

[30]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[31]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[32]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[33]  Michalis Vazirgiannis,et al.  Clustering and Community Detection in Directed Networks: A Survey , 2013, ArXiv.

[34]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[35]  Tu Bao Ho,et al.  A novel graph-based similarity measure for 2D chemical structures. , 2004, Genome informatics. International Conference on Genome Informatics.

[36]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[37]  Jure Leskovec,et al.  Community Detection in Networks with Node Attributes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[38]  Robert E. Tarjan,et al.  Graph Clustering and Minimum Cut Trees , 2004, Internet Math..

[39]  Stefan Kramer,et al.  A structural cluster kernel for learning on graphs , 2012, KDD.

[40]  Hong Cheng,et al.  GBAGC: A General Bayesian Framework for Attributed Graph Clustering , 2014, TKDD.

[41]  Massimo Marchiori,et al.  Method to find community structures based on information centrality. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[42]  Ulrik Brandes,et al.  Experiments on Graph Clustering Algorithms , 2003, ESA.

[43]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[44]  Fabrizio Costa,et al.  Fast Neighborhood Subgraph Pairwise Distance Kernel , 2010, ICML.

[45]  Athena Vakali,et al.  Capturing Social Data Evolution Using Graph Clustering , 2013, IEEE Internet Computing.

[46]  Feng Zhang,et al.  Expression profiling based on graph-clustering approach to determine colon cancer pathway. , 2013, Journal of cancer research and therapeutics.

[47]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[48]  Jiawei Han,et al.  Non-negative Matrix Factorization on Manifold , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[49]  Xinlei Chen,et al.  Large Scale Spectral Clustering with Landmark-Based Representation , 2011, AAAI.

[50]  Philip S. Yu,et al.  On Clustering Graph Streams , 2010, SDM.

[51]  Bin Li,et al.  Fast Graph Stream Classification Using Discriminative Clique Hashing , 2013, PAKDD.

[52]  Ting Guo,et al.  Understanding the roles of sub-graph features for graph classification: an empirical study perspective , 2013, CIKM.

[53]  Alexandros Stamatakis,et al.  Parallel Structural Graph Clustering , 2011, ECML/PKDD.

[54]  Kaspar Riesen,et al.  Graph Classification and Clustering Based on Vector Space Embedding , 2010, Series in Machine Perception and Artificial Intelligence.

[55]  Xue Chen,et al.  Building Association Link Network for Semantic Link on Web Resources , 2011, IEEE Transactions on Automation Science and Engineering.

[56]  Jacobo Torán On the Hardness of Graph Isomorphism , 2004, SIAM J. Comput..

[57]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[58]  Chris H. Q. Ding,et al.  Convex and Semi-Nonnegative Matrix Factorizations , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Jacobo Torán,et al.  On the hardness of graph isomorphism , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[60]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.