Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing

Motivation: Similarity searching and clustering of chemical compounds by structural similarities are important computational approaches for identifying drug-like small molecules. Most algorithms available for these tasks are limited by their speed and scalability, and cannot handle today's large compound databases with several million entries. Results: In this article, we introduce a new algorithm for accelerated similarity searching and clustering of very large compound sets using embedding and indexing (EI) techniques. First, we present EI-Search as a general purpose similarity search method for finding objects with similar features in large databases and apply it here to searching and clustering of large compound sets. The method embeds the compounds in a high-dimensional Euclidean space and searches this space using an efficient index-aware nearest neighbor search method based on locality sensitive hashing (LSH). Second, to cluster large compound sets, we introduce the EI-Clustering algorithm that combines the EI-Search method with Jarvis–Patrick clustering. Both methods were tested on three large datasets with sizes ranging from about 260 000 to over 19 million compounds. In comparison to sequential search methods, the EI-Search method was 40–200 times faster, while maintaining comparable recall rates. The EI-Clustering method allowed us to significantly reduce the CPU time required to cluster these large compound libraries from several months to only a few days. Availability: Software implementations and online services have been developed based on the methods introduced in this study. The online services provide access to the generated clustering results and ultra-fast similarity searching of the PubChem Compound database with subsecond response time. Contact: thomas.girke@ucr.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Pierre Baldi,et al.  ChemDB update - full-text search and virtual chemical space , 2007, Bioinform..

[2]  Dimitris K. Agrafiotis,et al.  Multidimensional scaling and visualization of large molecular similarity tables , 2001, J. Comput. Chem..

[3]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[4]  Dimitris K. Agrafiotis,et al.  Stochastic proximity embedding , 2003, J. Comput. Chem..

[5]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[6]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[7]  Peter Willett,et al.  Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. , 2003, Journal of molecular graphics & modelling.

[8]  T. Insel,et al.  NIH Molecular Libraries Initiative , 2004, Science.

[9]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[10]  Martin Serrano,et al.  Nucleic Acids Research Advance Access published October 18, 2007 ChemBank: a small-molecule screening and , 2007 .

[11]  Daniel R. Caffrey,et al.  Structure-based maximal affinity model predicts small-molecule druggability , 2007, Nature Biotechnology.

[12]  Pierre Baldi,et al.  Speeding Up Chemical Database Searches Using a Proximity Filter Based on the Logical Exclusive OR , 2008, J. Chem. Inf. Model..

[13]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[14]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[15]  R. Strausberg,et al.  From Knowing to Controlling: A Path from Genomics to Drugs Using Small Molecule Probes , 2003, Science.

[16]  Dimitris K. Agrafiotis,et al.  An Efficient Implementation of Distance-Based Diversity Measures Based on k-d Trees , 1999, J. Chem. Inf. Comput. Sci..

[17]  Stephen J Haggarty,et al.  The principle of complementarity: chemical versus biological space. , 2005, Current opinion in chemical biology.

[18]  References , 1971 .

[19]  Dimitris K. Agrafiotis,et al.  Nearest Neighbor Search in General Metric Spaces Using a Tree Data Structure with a Simple Heuristic , 2003, J. Chem. Inf. Comput. Sci..

[20]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[21]  Thomas Girke,et al.  ChemMine. A Compound Mining Database for Chemical Genomics1 , 2005, Plant Physiology.

[22]  Andrew Smellie,et al.  Visualization and Interpretation of High Content Screening Data , 2006, J. Chem. Inf. Model..

[23]  Tudor I. Oprea,et al.  Chemical space navigation in lead discovery. , 2002, Current opinion in chemical biology.

[24]  Tao Jiang,et al.  A maximum common substructure-based algorithm for searching and predicting drug-like compounds , 2008, ISMB.

[25]  Frank Oellien,et al.  Enhanced CACTVS Browser of the Open NCI Database , 2002, J. Chem. Inf. Comput. Sci..

[26]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[27]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[28]  Ada Wai-Chee Fu,et al.  Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances , 2000, The VLDB Journal.

[29]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[30]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[31]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[32]  Pravin M. Vaidya,et al.  AnO(n logn) algorithm for the all-nearest-neighbors Problem , 1989, Discret. Comput. Geom..

[33]  Nikolay P Savchuk,et al.  Exploring the chemogenomic knowledge space with annotated chemical libraries. , 2004, Current opinion in chemical biology.

[34]  John M. Barnard,et al.  Clustering Methods and Their Uses in Computational Chemistry , 2003 .

[35]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[36]  Pierre Baldi,et al.  Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time , 2007, J. Chem. Inf. Model..

[37]  P. Willett Searching techniques for databases of two- and three-dimensional chemical structures. , 2005, Journal of medicinal chemistry.

[38]  Richard C. T. Lee,et al.  A Heuristic Relaxation Method for Nonlinear Mapping in Cluster Analysis , 1973, IEEE Trans. Syst. Man Cybern..

[39]  Tudor I. Oprea,et al.  Systems chemical biology. , 2007 .

[40]  Huafeng Xu,et al.  A self-organizing principle for learning nonlinear manifolds , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[41]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[42]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[43]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[44]  Xin Chen,et al.  Performance of Similarity Measures in 2D Fragment-Based Similarity Searching: Comparison of Structural Descriptors and Similarity Coefficients , 2002, J. Chem. Inf. Comput. Sci..