A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: reliability-density neighbourhood

The ability to define the regions of chemical space where a predictive model can be safely used is a necessary condition to assure the reliability of new predictions. This implies that reliability must be determined across chemical space in the attempt to localize “safe” and “unsafe” regions for prediction. As a result we devised an applicability domain technique that addresses the data locally instead of handling it as a whole—the reliability-density neighbourhood (RDN). The main novelty aspect of this method is that it characterizes each single training instance according to the density of its neighbourhood in the training set, as well as its individual bias and precision. By scanning through the chemical space (by iteratively increasing the applicability domain area), it was observed that new test compounds are successively included into the applicability domain region in such a manner that strongly correlates to their predictive performance. This allows the mapping of local reliability across different locations in the training set space, and thus allows identifying regions where the model has low reliability. This method also showed matching profiles between two external sets, which is an indication that it performs robustly with new data. Another novel aspect in this technique is that it is paired with a specific feature selection algorithm. As a result, the impact of the feature set used was studied from which the top 20 features selected by ReliefF yielded the best results, as opposed to using the model’s features or the entire feature set as commonly done. As the third novel aspect, in this work we propose a new scoring function to help evaluate the quality of an applicability domain profile (i.e., the curve of accuracy vs the applicability domain measure in question). Overall, the RDN showed to be a promising method that can correctly sort new instances according to predictive performance. As a result, this technique can be received by an end-user as proof of concept for the performance of a QSAR model in new data, thus promoting the user’s trust on the QSAR output.Graphical abstract.

[1]  S. Ambudkar,et al.  Molecular basis of the polyspecificity of P-glycoprotein (ABCB1): recent biochemical and structural studies. , 2015, Advances in cancer research.

[2]  Igor Kononenko,et al.  ReliefF for estimation and discretization of attributes in classification, regression, and ILP probl , 1996 .

[3]  Hiromasa Kaneko,et al.  Applicability Domain Based on Ensemble Learning in Classification and Regression Analyses , 2014, J. Chem. Inf. Model..

[4]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[5]  Fabio Broccatelli,et al.  QSAR Models for P-Glycoprotein Transport Based on a Highly Consistent Data Set , 2012, J. Chem. Inf. Model..

[6]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[7]  Gilles Marcou,et al.  Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models , 2009, J. Chem. Inf. Model..

[8]  U Sahlin,et al.  Applicability Domain Dependent Predictive Uncertainty in QSAR Regressions , 2014, Molecular informatics.

[9]  Newton Spolaôr,et al.  A Comparison of Multi-label Feature Selection Methods using the Problem Transformation Approach , 2013, CLEI Selected Papers.

[10]  Nitesh V. Chawla,et al.  Many Are Better Than One: Improving Probabilistic Estimates from Decision Trees , 2005, MLCW.

[11]  Emilio Benfenati,et al.  Evaluating the applicability domain in the case of classification predictive models for carcinogenicity based on the counter propagation artificial neural network , 2011, J. Comput. Aided Mol. Des..

[12]  Paola Gramatica,et al.  Introduction General Considerations , 2022 .

[13]  Verónica Bolón-Canedo,et al.  A Distributed Feature Selection Approach Based on a Complexity Measure , 2015, IWANN.

[14]  Suresh Venkatasubramanian,et al.  Curve Matching, Time Warping, and Light Fields: New Algorithms for Computing Similarity between Curves , 2007, Journal of Mathematical Imaging and Vision.

[15]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[16]  Shigeru Shinomoto,et al.  Kernel bandwidth optimization in spike rate estimation , 2009, Journal of Computational Neuroscience.

[17]  Roberto Todeschini,et al.  Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions , 2013, Journal of Cheminformatics.

[18]  Gerald M. Maggiora,et al.  On Outliers and Activity Cliffs-Why QSAR Often Disappoints , 2006, J. Chem. Inf. Model..

[19]  Robert P. Sheridan,et al.  Three Useful Dimensions for Domain Applicability in QSAR Models Using Random Forest , 2012, J. Chem. Inf. Model..

[20]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[21]  Ullrika Sahlin,et al.  Uncertainty in QSAR Predictions , 2013, Alternatives to laboratory animals : ATLA.

[22]  M. Shahlaei Descriptor selection methods in quantitative structure-activity relationship studies: a review study. , 2013, Chemical reviews.

[23]  Yvan Vander Heyden,et al.  Towards better understanding of feature-selection or reduction techniques for Quantitative Structure–Activity Relationship models , 2013 .

[24]  Igor V. Tetko,et al.  Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set , 2010, J. Chem. Inf. Model..

[25]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[26]  Tina Ritschel,et al.  KRIPO – a structure-based pharmacophores approach explains polypharmacological effects , 2014, Journal of Cheminformatics.

[27]  Verónica Bolón-Canedo,et al.  A review of feature selection methods on synthetic data , 2013, Knowledge and Information Systems.

[28]  Roberto Todeschini,et al.  Comparison of Different Approaches to Define the Applicability Domain of QSAR Models , 2012, Molecules.

[29]  Andreas Bender,et al.  Simultaneous Prediction of four ATP‐binding Cassette Transporters’ Substrates Using Multi‐label QSAR , 2016, Molecular informatics.

[30]  Dimitris K. Agrafiotis,et al.  Developing Best Practices for Descriptor‐Based Property Prediction: Appropriate Matching of Datasets, Descriptors, Methods, and Expectations , 2012 .

[31]  Tropsha Alexander,et al.  Predictive Quantitative Structure–Activity Relationships Modeling Data Preparation and the General Modeling Workflow , 2010 .

[32]  M. Jamei,et al.  Variability in P-Glycoprotein Inhibitory Potency (IC50) Using Various in Vitro Experimental Systems: Implications for Universal Digoxin Drug-Drug Interaction Risk Assessment Decision Criteria , 2013, Drug Metabolism and Disposition.

[33]  I. Tetko,et al.  Applicability domain for in silico models to achieve accuracy of experimental measurements , 2010 .

[34]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[35]  Alex Alves Freitas,et al.  Attribute Selection with a Multi-objective Genetic Algorithm , 2002, SBIA.

[36]  Weida Tong,et al.  Assessment of Prediction Confidence and Domain Extrapolation of Two Structure–Activity Relationship Models for Predicting Estrogen Receptor Binding Activity , 2004, Environmental health perspectives.

[37]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[38]  Igor V. Tetko,et al.  Prediction-driven matched molecular pairs to interpret QSARs and aid the molecular optimization process , 2014, Journal of Cheminformatics.

[39]  Huan Liu,et al.  Feature selection for classification: A review , 2014 .

[40]  Ferran Sanz,et al.  Applicability Domain Analysis (ADAN): A Robust Method for Assessing the Reliability of Drug Property Predictions , 2014, J. Chem. Inf. Model..