Statistical-based database fingerprint: chemical space dependent representation of compound databases

BackgroundSimplified representation of compound databases has several applications in cheminformatics. Herein, we introduce an alternative and general method to build single fingerprint representations of compound databases. The approach is inspired on the previously published modal fingerprints that are aimed to capture the most significant bits of a fingerprint representation for a compound data set. The novelty of the herein proposed statistical-based database fingerprint (SB-DFP) is that it is generated based on binomial proportions comparisons taking as reference the distribution of “1” bits on a large representative set of the chemical space.ResultsTo illustrate the Method, SB-DFPs were constructed for 28 epigenetic target data sets retrieved from a recently published epigenomics database of interest in probe and drug discovery. For each target data set, the SB-DFPs were built based on two representative fingerprints of different design using as reference a data set with more than 15 million compounds from ZINC. The application of SB-DFP was illustrated and compared to other methods through association relationships of the 28 epigenetic data sets and similarity searching. It was found that SB-DFPs captured overall, the common features between data sets and the distinct features of each set. In similarity searching SB-DFP equaled or outperformed other approaches for at least 20 out of the 28 sets.ConclusionsSB-DFP is a general approach based on binomial proportion comparisons to represent a compound data set with a single fingerprint. SB-DFP can be developed, at least in principle, based on any fingerprint and reference data set. SB-DFP is a good alternative for exploration of relationships between targets through its associated compound data sets and performing similarity searching.

[1]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[2]  James G. Nourse,et al.  Reoptimization of MDL Keys for Use in Drug Discovery , 2002, J. Chem. Inf. Comput. Sci..

[3]  Woody Sherman,et al.  Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods , 2010, J. Cheminformatics.

[4]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[5]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[6]  Eugen Lounkine,et al.  Relevance of Feature Combinations for Similarity Searching Using General or Activity Class-Directed Molecular Fingerprints , 2009, J. Chem. Inf. Model..

[7]  Cheng Luo,et al.  Computer-Aided Drug Design in Epigenetics , 2018, Front. Chem..

[8]  José L. Medina-Franco,et al.  Database fingerprint (DFP): an approach to represent molecular databases , 2017, Journal of Cheminformatics.

[9]  Andrius Merkys,et al.  A posteriori metadata from automated provenance tracking: integration of AiiDA and TCOD , 2017, Journal of Cheminformatics.

[10]  José L Medina-Franco,et al.  Insights from pharmacological similarity of epigenetic targets in epipolypharmacology. , 2018, Drug discovery today.

[11]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[12]  Károly Héberger,et al.  Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? , 2015, Journal of Cheminformatics.

[13]  Jérôme Hert,et al.  Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures , 2004, J. Chem. Inf. Model..

[14]  David Weininger,et al.  Stigmata: An Algorithm To Determine Structural Commonalities in Diverse Datasets , 1996, J. Chem. Inf. Comput. Sci..

[15]  M. Cugmas,et al.  On comparing partitions , 2015 .

[16]  Bengt J Allen,et al.  Statistics: Concepts and Applications for Science.ByDavid LeBlanc.Sudbury (Massachusetts): Jones and Bartlett Publishers. $89.95 (two‐volume set). xvii + 382 p; ill.; index. ISBN: 0–7637–4699–1. 2004.Workbook to AccompanyStatistics: Concepts and Applications for Science.ByDavid LeBlanc.Sudbury (Mass , 2004 .

[17]  Jose Medina-Franco,et al.  Epi-Informatics : Discovery and Development of Small Molecule Epigenetic Drugs and Probes , 2016 .

[18]  Prasenjit Mukherjee,et al.  An overview of molecular fingerprint similarity search in virtual screening , 2016, Expert opinion on drug discovery.

[19]  Adrià Cereto-Massagué,et al.  Molecular fingerprint similarity search in virtual screening. , 2015, Methods.

[20]  Jürgen Bajorath,et al.  Profile Scaling Increases the Similarity Search Performance of Molecular Fingerprints Containing Numerical Descriptors and Structural Keys , 2003, J. Chem. Inf. Comput. Sci..

[21]  Jürgen Bajorath,et al.  Bit Silencing in Fingerprints Enables the Derivation of Compound Class-Directed Similarity Metrics , 2008, J. Chem. Inf. Model..

[22]  Jürgen Bajorath,et al.  Similarity Search Profiling Reveals Effects of Fingerprint Scaling in Virtual Screening , 2004, J. Chem. Inf. Model..

[23]  Kathrin Heikamp,et al.  Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets , 2011, J. Chem. Inf. Model..

[24]  Kathrin Heikamp,et al.  Fingerprint design and engineering strategies: rationalizing and improving similarity search performance. , 2012, Future medicinal chemistry.

[25]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[26]  Jürgen Bajorath,et al.  Fingerprint Scaling Increases the Probability of Identifying Molecules with Similar Activity in Virtual Screening Calculations , 2001, J. Chem. Inf. Comput. Sci..