Parallel Structural Graph Clustering

We address the problem of clustering large graph databases according to scaffolds (i.e., large structural overlaps) that are shared between cluster members. In previous work, an online algorithm was proposed for this task that produces overlapping (non-disjoint) and nonexhaustive clusterings. In this paper, we parallelize this algorithm to take advantage of high-performance parallel hardware and further improve the algorithm in three ways: a refined cluster membership test based on a set abstraction of graphs, sorting graphs according to size, to avoid cluster membership tests in the first place, and the definition of a cluster representative once the cluster scaffold is unique, to avoid cluster comparisons with all cluster members. In experiments on a large database of chemical structures, we show that running times can be reduced by a large factor for one parameter setting used in previous work. For harder parameter settings, it was possible to obtain results within reasonable time for 300,000 structures, compared to 10,000 structures in previous work. This shows that structural, scaffold-based clustering of smaller libraries for virtual screening is already feasible.

[1]  Pierre Baldi,et al.  ChemDB update - full-text search and virtual chemical space , 2007, Bioinform..

[2]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[3]  Harald Mauser,et al.  Database Clustering with a Combination of Fingerprint and Maximum Common Substructure Methods , 2005, J. Chem. Inf. Model..

[4]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[5]  Hiroshi Motoda,et al.  Graph Clustering Based on Structural Similarity of Fragments , 2005, Federation over the Web.

[6]  Stefan Kramer,et al.  Online Structural Graph Clustering Using Frequent Subgraph Mining , 2010, ECML/PKDD.

[7]  M. Shahriar Hossain,et al.  GDClust: A Graph-Based Document Clustering Technique , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[8]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Nicolas Spyratos,et al.  Federation over the Web - International Workshop, Dagstuhl Castle, Germany, May 1-6, 2005. Revised Selected Papers , 2010, Federation over the Web.

[10]  Taku Kudo,et al.  Clustering graphs by weighted substructure mining , 2006, ICML.

[11]  Peter Willett,et al.  Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. , 2003, Journal of molecular graphics & modelling.

[12]  Malcolm J. McGregor,et al.  Clustering of Large Databases of Compounds: Using the MDL "Keys" as Structural Descriptors , 1997, J. Chem. Inf. Comput. Sci..

[13]  Yan Jia,et al.  Stream Event Detection: A Unified Framework for Mining Outlier, Change and Burst Simultaneously over Data Stream , 2007 .

[14]  Pierre Baldi,et al.  ChemDB: a public database of small molecules and related chemoinformatics resources , 2005, Bioinform..

[15]  Peter Willett,et al.  Promoting Access to White Rose Research Papers Effectiveness of Graph-based and Fingerprint-based Similarity Measures for Virtual Screening of 2d Chemical Structure Databases , 2022 .