GA Strategy for Variable Selection in QSAR Studies: Enhancement of Comparative Molecular Binding Energy Analysis by GA‐Based PLS Method

Comparative molecular binding energy (COMBINE) is a novel approach for estimation of binding affinity in structure-based drug design (SBDD). COMBINE involves an extensive partitioning of binding interaction energy and multivariate regression analysis to derive a model. In COMBINE, partial least squares (PLS) is especially used as a statistical method. Although PLS is robust and stable, it has been shown that its predictive performance drops with the increase of number of variables. Also, from a practical point of view, model becomes complicated and its interpretation is difficult if we use many variables. Therefore, it is expected that PLS coupled with variable selection can produce a more predictive and interpretable model in COMBINE. The purpose of this paper is to examine whether genetic algorithm-based PLS (GAPLS) developed by our group for variable selection can enhance prediction and interpretation of the COMBINE model. The structure-activity data of human immuno-deficiency virus type I (HIV-1) protease inhibitors were used as a test example. By applying GAPLS to this data set, several improved PLS models with a high cross-validated r2 value and low number of variables were obtained. In order to select a best model from them, external validation was performed for each model. The finally selected model was further examined by comparing with the 3D structure of HIV-1 protease in computer graphics and its agreement was confirmed.