Developing Collaborative QSAR Models Without Sharing Structures

It is widely understood that QSAR models greatly improve if more data are used. However, irrespective of model quality, once chemical structures diverge too far from the initial data set, the predictive performance of a model degrades quickly. To increase the applicability domain we need to increase the diversity of the training set. This can be achieved by combining data from diverse sources. Public data can be easily included; however, proprietary data may be more difficult to add due to intellectual property concerns. In this contribution, we will present a method for the collaborative development of linear regression models that addresses this problem. The method differs from other past approaches, because data are only shared in an aggregated form. This prohibits access to individual data points and therefore avoids the disclosure of confidential structural information. The final models are equivalent to models that were built with combined data sets.

[1]  Anang A. Shelat,et al.  Chemical genetics of Plasmodium falciparum , 2010, Nature.

[2]  Julie Clark,et al.  Shared Consensus Machine Learning Models for Predicting Blood Stage Malaria Inhibition , 2017, J. Chem. Inf. Model..

[3]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[4]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[5]  A. Leo,et al.  Some advantages of calculating octanol-water partition coefficients. , 1987, Journal of pharmaceutical sciences.

[6]  Jean-Loup Faulon,et al.  Reverse engineering chemical structures from molecular descriptors: how many solutions? , 2005, J. Comput. Aided Mol. Des..

[7]  Anna Vulpetti,et al.  Making sure there's a "give" associated with the "take": producing and using open-source software in big pharma , 2011, J. Cheminformatics.

[8]  Alex M. Clark,et al.  Open Source Bayesian Models. 2. Mining a "Big Dataset" To Create and Validate Models with ChEMBL , 2015, J. Chem. Inf. Model..

[9]  Alex M. Clark,et al.  Open Source Bayesian Models. 3. Composite Models for Prediction of Binned Responses , 2016, J. Chem. Inf. Model..

[10]  K. Baumann,et al.  Chemoinformatic Classification Methods and their Applicability Domain , 2016, Molecular informatics.

[11]  Sereina Riniker,et al.  Heterogeneous Classifier Fusion for Ligand-Based Virtual Screening: Or, How Decision Making by Committee Can Be a Good Thing , 2013, J. Chem. Inf. Model..

[12]  G. Poda,et al.  Application of ALOGPS 2.1 to predict log D distribution coefficient for Pfizer proprietary compounds. , 2004, Journal of medicinal chemistry.

[13]  Gordon M. Crippen,et al.  Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. Modeling dispersive and hydrophobic interactions , 1987, J. Chem. Inf. Comput. Sci..

[14]  Albert J. Leo,et al.  Calculating log P(oct) with no missing fragments; The problem of estimating new interaction parameters , 2000 .

[15]  C Silipo,et al.  Calculation of hydrophobic constant (log P) from pi and f constants. , 1975, Journal of medicinal chemistry.

[16]  Alex M. Clark,et al.  Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets , 2015, J. Chem. Inf. Model..

[17]  Thierry Kogej,et al.  Big pharma screening collections: more of the same or unique libraries? The AstraZeneca-Bayer Pharma AG case. , 2013, Drug discovery today.

[18]  A M Richard,et al.  An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling$ , 2016, SAR and QSAR in environmental research.

[19]  James R. Brown,et al.  Thousands of chemical starting points for antimalarial lead identification , 2010, Nature.

[20]  Peter Gedeck,et al.  QSAR - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets , 2006, J. Chem. Inf. Model..

[21]  A. Ghose,et al.  Atomic physicochemical parameters for three dimensional structure directed quantitative structure‐activity relationships III: Modeling hydrophobic interactions , 1988 .

[22]  Andrew G. Leach,et al.  Matched molecular pair analysis in drug discovery. , 2013, Drug discovery today.

[23]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[24]  Igor V Tetko,et al.  Large, chemically diverse dataset of logP measurements for benchmarking studies. , 2013, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[25]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  Raimund Mannhold,et al.  Large‐Scale Evaluation of log P Predictors: Local Corrections May Compensate Insufficient Accuracy and Need of Experimentally Testing Every Other Compound , 2009, Chemistry & biodiversity.

[28]  A. Leo CALCULATING LOG POCT FROM STRUCTURES , 1993 .

[29]  A. Ghose,et al.  Atomic Physicochemical Parameters for Three‐Dimensional Structure‐Directed Quantitative Structure‐Activity Relationships I. Partition Coefficients as a Measure of Hydrophobicity , 1986 .

[30]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..