Graphclust: a Method for Clustering Database of Graphs

Any application that represents data as sets of graphs may benefit from the discovery of relationships among those graphs. To do this in an unsupervised fashion requires the ability to find graphs that are similar to one another. That is the purpose of GraphClust. The GraphClust algorithm proceeds in three phases, often building on other tools:(1) it finds highly connected substructures in each graph;(2) it uses those substructures to represent each graph as a feature vector; and(3) it clusters these feature vectors using a standard distance measure. We validate the cluster quality by using the Silhouette method. In addition to clustering graphs, GraphClust uses SVD decomposition to find frequently co-occurring connected substructures. The main novelty of GraphClust compared to previous methods is that it is application-independent and scalable to many large graphs.

[1]  L. Kochian,et al.  Nitrate-induced genes in tomato roots. Array analysis reveals novel genes that may play a role in nitrogen nutrition. , 2001, Plant physiology.

[2]  Yvonne C. Martin,et al.  The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding , 1997, J. Chem. Inf. Comput. Sci..

[3]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[4]  Edwin R. Hancock,et al.  Spectral Clustering of Graphs , 2003, GbRPR.

[5]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[6]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[7]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[8]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.

[9]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[10]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[11]  Lawrence B. Holder,et al.  Subdue: compression-based frequent pattern discovery in graph data , 2005 .

[12]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[13]  George Karypis,et al.  Finding Frequent Patterns in a Large Sparse Graph* , 2005, Data Mining and Knowledge Discovery.

[14]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[15]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[16]  Ruedi Stoop,et al.  Sequential Superparamagnetic Clustering for Unbiased Classification of High-Dimensional Chemical Data , 2004, J. Chem. Inf. Model..

[17]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[18]  Edwin R. Hancock,et al.  Spectral Feature Vectors for Graph Clustering , 2002, SSPR/SPR.

[19]  Rongchen Wang,et al.  Genomic Analysis of a Nutrient Response in Arabidopsis Reveals Diverse Expression Patterns and Novel Metabolic and Potential Regulatory Genes Induced by Nitrate , 2000, Plant Cell.

[20]  John M. Barnard,et al.  Clustering Methods and Their Uses in Computational Chemistry , 2003 .

[21]  Francisco Azuaje,et al.  Improving expression data mining through cluster validation , 2003, 4th International IEEE EMBS Special Topic Conference on Information Technology Applications in Biomedicine, 2003..

[22]  Gultekin Özsoyoglu,et al.  A graph query language and its query processing , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[23]  Peter Willett,et al.  Algorithms for the identification of three-dimensional maximal common substructures , 1987, J. Chem. Inf. Comput. Sci..

[24]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[25]  Rongchen Wang,et al.  Microarray Analysis of the Nitrate Response in Arabidopsis Roots and Shoots Reveals over 1,000 Rapidly Responding Genes and New Linkages to Glucose, Trehalose-6-Phosphate, Iron, and Sulfate Metabolism1[w] , 2003, Plant Physiology.

[26]  Horst Bunke,et al.  Subgraph Isomorphism Detection in Polynominal Time on Preprocessed Model Graphs , 1995, ACCV.

[27]  Irene Luque Ruiz,et al.  Clustering Chemical Databases Using Adaptable Projection Cells and MCS Similarity Values , 2005, J. Chem. Inf. Model..

[28]  Rudi Verbeeck,et al.  CerBeruS: A System Supporting the Sequential Screening Process , 2000, J. Chem. Inf. Comput. Sci..

[29]  Gloria Coruzzi,et al.  Genomic Analysis of the Nitrate Response Using a Nitrate Reductase-Null Mutant of Arabidopsis1[w] , 2004, Plant Physiology.

[30]  Horst Bunke,et al.  Graph-Based Tools for Data Mining and Machine Learning , 2003, MLDM.

[31]  Falk Schreiber,et al.  Frequency Concepts and Pattern Detection for the Analysis of Motifs in Networks , 2005, Trans. Comp. Sys. Biology.

[32]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[33]  Terry Caelli,et al.  Inexact Multisubgraph Matching Using Graph Eigenspace and Clustering Models , 2002, SSPR/SPR.

[34]  Dan Suciu,et al.  An overview of semistructured data , 1998, SIGA.

[35]  Diego Reforgiato Recupero,et al.  Antipole tree indexing to support range search and k-nearest neighbor search in metric spaces , 2005, IEEE Transactions on Knowledge and Data Engineering.

[36]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[37]  James B. Dunbar,et al.  Enhancing the diversity of a corporate database using chemical database clustering and analysis , 1995, J. Comput. Aided Mol. Des..

[38]  C. John Blankley,et al.  Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping Using Ward's Clustering , 2000, J. Chem. Inf. Comput. Sci..

[39]  Peter Willett,et al.  Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. , 2003, Journal of molecular graphics & modelling.

[40]  Jonathan S. Mason,et al.  Rational Screening Set Design and Compound Selection: Cascaded Clustering , 1998, J. Chem. Inf. Comput. Sci..