Accelerating Large-Scale Molecular Similarity Search through Exploiting High Performance Computing

Molecular similarity search is a simple but powerful chemoinformatics tool to rapidly find molecules that are structurally similar to a known reference compound from a large molecular database. A variety of indexing structures had been developed to improve the performance of similarity search over the large compound database. However, those algorithms often require a large computational cost to build indices and process queries, especially for a large-scale molecular dataset. We study the problem of accelerating similarity search using high performance computing (HPC) and design general algorithms to speed up existing indexing algorithms. We first propose a parallel algorithm based on data chunking, working for all indexing algorithms for similarity search. We theoretically analyze its computation cost and relationships between the speedup and number of data chunks. We further propose a parallel query algorithm for all graph-based indexing algorithms to accelerate their query processing in HPC. Both of our algorithms consistently offer a greater speedup than the baseline algorithm(s) when evaluated with different datasets and parameter settings.

[1]  Gang Chen,et al.  Metric Similarity Joins Using MapReduce , 2017, IEEE Transactions on Knowledge and Data Engineering.

[2]  Daisuke Miyazaki,et al.  Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Data , 2018, ArXiv.

[3]  Peter Willett,et al.  Similarity-based virtual screening using 2D fingerprints. , 2006, Drug discovery today.

[4]  Xin Yan,et al.  Chemical Structure Similarity Search for Ligand-based Virtual Screening: Methods and Computational Resources. , 2016, Current drug targets.

[5]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[7]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[8]  Wu Zhong,et al.  Parallelization of Molecular Docking: A Review. , 2018, Current topics in medicinal chemistry.

[9]  Pierre Baldi,et al.  Speeding Up Chemical Searches Using the Inverted Index: The Convergence of Chemoinformatics and Text Search Methods , 2012, J. Chem. Inf. Model..

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  Masajiro Iwasaki Pruned Bi-directed K-nearest Neighbor Graph for Proximity Search , 2016, SISAP.

[12]  Jiajin Le,et al.  An Efficient Parallel Top-k Similarity Join for Massive Multidimensional Data Using Spark , 2015 .

[13]  Xiaohua Zhang,et al.  Toward Fully Automated High Performance Computing Drug Discovery: A Massively Parallel Virtual Screening Pipeline for Docking and Molecular Mechanics/Generalized Born Surface Area Rescoring to Improve Enrichment , 2014, J. Chem. Inf. Model..

[14]  Vladimir Krylov,et al.  Approximate nearest neighbor algorithm based on navigable small world graphs , 2014, Inf. Syst..

[15]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..