SketchSort: Fast All Pairs Similarity Search for Large Databases of Molecular Fingerprints

Similarity networks of ligands are often reported useful in predicting chemical activities and target proteins. However, the naive method of computing all pairwise similarities of chemical fingerprints takes quadratic time, which is prohibitive for large scale databases with millions of ligands. We propose a fast all pairs similarity search method, called SketchSort, that maps chemical fingerprints to symbol strings with random projections, and finds similar strings by multiple masked sorting. Due to random projection, SketchSort misses a certain fraction of neighbors (i.e., false negatives). Nevertheless, the expected fraction of false negatives is theoretically derived and can be kept under a very small value. Experiments show that SketchSort is much faster than other similarity search methods and enables us to obtain a PubChem‐scale similarity network quickly.

[1]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[2]  Takeaki Uno Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data , 2009, Knowledge and Information Systems.

[3]  Pierre Baldi,et al.  Hashing Algorithms and Data Structures for Rapid Searches of Fingerprint Vectors , 2010, J. Chem. Inf. Model..

[4]  Karl R. Abrahamson Generalized String Matching , 1987, SIAM J. Comput..

[5]  Peter C. Jurs Similarity and clustering in chemical information systems, by Peter Willett, research studies press, Letchworth, Hertfordshire, England, 230 + xii pp, $54.95, (1987) , 1988 .

[6]  P Willett,et al.  Similarity-based approaches to virtual screening. , 2003, Biochemical Society transactions.

[7]  Andrew R. Leach,et al.  An Introduction to Chemoinformatics , 2003 .

[8]  Pierre Baldi,et al.  Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time , 2007, J. Chem. Inf. Model..

[9]  John M. Barnard,et al.  Clustering of chemical structures on the basis of two-dimensional similarity measures , 1992, J. Chem. Inf. Comput. Sci..

[10]  Pierre Baldi,et al.  Speeding Up Chemical Database Searches Using a Proximity Filter Based on the Logical Exclusive OR , 2008, J. Chem. Inf. Model..

[11]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[12]  Tao Jiang,et al.  Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing , 2010, Bioinform..

[13]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[14]  Malcolm J. McGregor,et al.  Clustering of Large Databases of Compounds: Using the MDL "Keys" as Structural Descriptors , 1997, J. Chem. Inf. Comput. Sci..