Variable Selection and Interpretation in Structure-Affinity Correlation Modeling of Estrogen Receptor Binders

A computational approach for the identification and investigation of correlations between a chemical structure and a selected biological property is described. It is based on a set of 132 compounds of known chemical structures, which were tested for their binding affinities to the estrogen receptor. Different multivariate modeling methods, i.e., partial least-squares regression, counterpropagation neural network, and error-back-propagation neural network, were applied, and the prediction ability of each model was tested in order to compare the results of the obtained models. To reduce the extensive set of calculated structural descriptors, two types of variable selection methods were applied, depending on the modeling approach used. In particular, the final partial least-squares regression model was built using the "variable importance in projection" variable selection method, while genetic algorithms were applied in neural network modeling to select the optimal set of descriptors. A thorough statistical study of the variables selected by genetic algorithms is shown. The results were assessed with the aim to get insight to the mechanisms involved in the binding of estrogenic compounds to the receptor. The variable selection on the basis of genetic algorithm was controlled with the test set of compounds, extracted from the data set available. To compare the predictive ability of all the optimized models, a leave-one-out cross-validation procedure was applied, the best model being the nonlinear neural network model based on error back-propagation algorithm, which resulted in R2= 92.2% and Q2= 70.8%.