Sachem: a chemical cartridge for high-performance substructure search

BackgroundStructure search is one of the valuable capabilities of small-molecule databases. Fingerprint-based screening methods are usually employed to enhance the search performance by reducing the number of calls to the verification procedure. In substructure search, fingerprints are designed to capture important structural aspects of the molecule to aid the decision about whether the molecule contains a given substructure. Currently available cartridges typically provide acceptable search performance for processing user queries, but do not scale satisfactorily with dataset size.ResultsWe present Sachem, a new open-source chemical cartridge that implements two substructure search methods: The first is a performance-oriented reimplementation of substructure indexing based on the OrChem fingerprint, and the second is a novel method that employs newly designed fingerprints stored in inverted indices. We assessed the performance of both methods on small, medium, and large datasets containing 1, 10, and 94 million compounds, respectively. Comparison of Sachem with other freely available cartridges revealed improvements in overall performance, scaling potential and screen-out efficiency.ConclusionsThe Sachem cartridge allows efficient substructure searches in databases of all sizes. The sublinear performance scaling of the second method and the ability to efficiently query large amounts of pre-extracted information may together open the door to new applications for substructure searches.

[1]  Lazaros Mavridis,et al.  Comprehensive Comparison of Ligand-Based Virtual Screening Tools Against the DUD Data set Reveals Limitations of Current 3D Methods , 2010, J. Chem. Inf. Model..

[2]  John M. Barnard,et al.  Substructure searching methods: Old and new , 1993, J. Chem. Inf. Comput. Sci..

[3]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[4]  Pu Liu,et al.  Power Keys: A Novel Class of Topological Descriptors Based on Exhaustive Subgraph Enumeration and their Application in Substructure Searching , 2011, J. Chem. Inf. Model..

[5]  Dmitry Pavlov,et al.  Bingo from SciTouch LLC: chemistry cartridge for Oracle database , 2010, J. Cheminformatics.

[6]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[7]  Pu Liu,et al.  Accelerating Chemical Database Searching Using Graphics Processing Units , 2011, J. Chem. Inf. Model..

[8]  Vincent Le Guilloux,et al.  Mining collections of compounds with Screening Assistant 2 , 2012, Journal of Cheminformatics.

[9]  Wolf-Dietrich Ihlenfeldt,et al.  Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility , 1994, J. Chem. Inf. Comput. Sci..

[10]  Wolf-Dietrich Ihlenfeldt,et al.  Tautomerism in large databases , 2010, J. Comput. Aided Mol. Des..

[11]  Matthias Rarey,et al.  Systematic benchmark of substructure search in molecular graphs - From Ullmann to VF2 , 2012, Journal of Cheminformatics.

[12]  Adrià Cereto-Massagué,et al.  Molecular fingerprint similarity search in virtual screening. , 2015, Methods.

[13]  A. Valencia,et al.  Information Retrieval and Text Mining Technologies for Chemistry. , 2017, Chemical reviews.

[14]  David Smiley,et al.  Apache Solr 4 Enterprise Search Server , 2015 .

[15]  Dimitris K. Agrafiotis,et al.  Efficient Substructure Searching of Large Chemical Libraries: The ABCD Chemical Cartridge , 2011, J. Chem. Inf. Model..

[16]  Antonio Zamora,et al.  An Algorithm for Finding the Smallest Set of Smallest Rings , 1976, J. Chem. Inf. Comput. Sci..

[17]  Antonio Lavecchia,et al.  Machine-learning approaches in drug discovery: methods and applications. , 2015, Drug discovery today.

[18]  Y. Sham,et al.  Rapid identification of Keap1-Nrf2 small-molecule inhibitors through structure-based virtual screening and hit-based substructure search. , 2014, Journal of medicinal chemistry.

[19]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[20]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[21]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[22]  Nils Weskamp Guided Iterative Substructure Search (GI‐SSS) – A New Trick for an Old Dog , 2016, Molecular informatics.

[23]  Manuel C. Peitsch,et al.  Building an R&D chemical registration system , 2012, Journal of Cheminformatics.

[24]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[25]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[26]  Igor V. Filippov,et al.  Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on , 2011, J. Cheminformatics.

[27]  Roger A. Sayle,et al.  Comparing structural fingerprints using a literature-based similarity benchmark , 2016, Journal of Cheminformatics.

[28]  Christoph Steinbeck,et al.  OrChem - An open source chemistry search engine for Oracle® , 2009, J. Cheminformatics.

[29]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[30]  C. Steinbeck,et al.  Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. , 2006, Current pharmaceutical design.

[31]  Y Z Chen,et al.  Recent progresses in the exploration of machine learning methods as in-silico ADME prediction tools. , 2015, Advanced drug delivery reviews.