Variable Selection in QSAR Studies. II. A Highly Efficient Combination of Systematic Search and Evolution

Recently two evolutionary strategies for the derivation of regression models, a genetic function approximation and the mutation/ selection algorithm MUSEUM have been described. The MUSEUM (Mutation and Selection Uncover Models) algorithm starts from a model containing randomly chosen variables. Random mutation, first by addition or elimination of only one or very few variables, afterwards by simultaneous random additions, eliminations and/or exchanges of several variables at a time, leads to new models which are evaluated by an appropriate fitness function. Only the “fittest” model is stored and used for further mutation and selection, leading to better and better models. However, the fitness of all models with up to three X variables can be determined much faster by calculation of the correlation coefficients ry.ij and ry.ijk from the partial correlation coefficients ryi, rij, ryj.j, rjk.i and ryk.ij. Using the Selwood data set (n = 31 compounds, k = 53 variables), it is demonstrated that systematic search is the best strategy for regression models with two or three X variables. The variables contained in the best three-variable models can be selected for further investigation, using the evolutionary approach. With the exception of complex models, containing six and more variables, nearly all relevant regression models are found by this combination of systematic search with the mutation/selection algorithm MUSEUM; the results are obtained in considerably shorter time than by including all variables in the calculations. In addition, systematic search is also a valuable tool for variable selection prior to stepwise regression and PLS analyses.