Blocked Inverted Indices for Exact Clustering of Large Chemical Spaces

The calculation of pairwise compound similarities based on fingerprints is one of the fundamental tasks in chemoinformatics. Methods for efficient calculation of compound similarities are of the utmost importance for various applications like similarity searching or library clustering. With the increasing size of public compound databases, exact clustering of these databases is desirable, but often computationally prohibitively expensive. We present an optimized inverted index algorithm for the calculation of all pairwise similarities on 2D fingerprints of a given data set. In contrast to other algorithms, it neither requires GPU computing nor yields a stochastic approximation of the clustering. The algorithm has been designed to work well with multicore architectures and shows excellent parallel speedup. As an application example of this algorithm, we implemented a deterministic clustering application, which has been designed to decompose virtual libraries comprising tens of millions of compounds in a short time on current hardware. Our results show that our implementation achieves more than 400 million Tanimoto similarity calculations per second on a common desktop CPU. Deterministic clustering of the available chemical space thus can be done on modern multicore machines within a few days.

[1]  David Vidal,et al.  LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities , 2005, J. Chem. Inf. Model..

[2]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[3]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[4]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[5]  Gisbert Schneider,et al.  A Hierarchical Clustering Approach for Large Compound Libraries , 2005, J. Chem. Inf. Model..

[6]  P Willett,et al.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. , 2002, Combinatorial chemistry & high throughput screening.

[7]  Andrew Smellie Compressed Binary Bit Trees: A New Data Structure For Accelerating Database Searching , 2009, J. Chem. Inf. Model..

[8]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[9]  J. Bajorath,et al.  State-of-the-art in ligand-based virtual screening. , 2011, Drug discovery today.

[10]  Pierre Baldi,et al.  Speeding Up Chemical Searches Using the Inverted Index: The Convergence of Chemoinformatics and Text Search Methods , 2012, J. Chem. Inf. Model..

[11]  Tao Jiang,et al.  Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing , 2010, Bioinform..

[12]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[13]  Andrew Dalke,et al.  The FPS fingerprint format and chemfp toolkit , 2013, Journal of Cheminformatics.

[14]  Thierry Kogej,et al.  Automated Recycling of Chemistry for Virtual Screening and Library Design , 2012, J. Chem. Inf. Model..

[15]  Ronan Bureau,et al.  Clustering files of chemical structures using the Székely-Rizzo generalization of Ward's method. , 2009, Journal of molecular graphics & modelling.

[16]  Peter Willett,et al.  Similarity Searching and Clustering of Chemical-Structure Databases Using Molecular Property Data , 1994, J. Chem. Inf. Comput. Sci..

[17]  Pierre Baldi,et al.  Speeding Up Chemical Database Searches Using a Proximity Filter Based on the Logical Exclusive OR , 2008, J. Chem. Inf. Model..

[18]  L. Kelley,et al.  An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally related subfamilies. , 1996, Protein engineering.

[19]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[20]  Takashi Okada,et al.  Finding Key Members in Compound Libraries by Analyzing Networks of Molecules Assembled by Structural Similarity , 2009, J. Chem. Inf. Model..

[21]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[22]  Hans-Peter Lenhof,et al.  BALL-rapid software prototyping in computational molecular biology , 2000, Bioinform..

[23]  Vijay S. Pande,et al.  Anatomy of High-Performance 2D Similarity Calculations , 2011, J. Chem. Inf. Model..

[24]  Christian N. S. Pedersen,et al.  Using Inverted Indices for Accelerating LINGO Calculations , 2011, J. Chem. Inf. Model..

[25]  Andreas Bender,et al.  Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naïve Bayesian Classifier , 2004, J. Chem. Inf. Model..

[26]  Sabine C. Mueller,et al.  BALL - biochemical algorithms library 1.3 , 2010, BMC Bioinformatics.