A Combinatorial Approach to the Variable Selection in Multiple Linear Regression: Analysis of Selwood et al. Data Set – A Case Study

A combinatorial protocol (CP) is introduced here to interface it with the multiple linear regression (MLR) for variable selection. The efficiency of CP-MLR is primarily based on the restriction of entry of correlated variables to the model development stage. It has been used for the analysis of Selwood et al data set [16], and the obtained models are compared with those reported from GFA [8] and MUSEUM [9] approaches. For this data set CP-MLR could identify three highly independent models (27, 28 and 31) with Q 2 value in the range of 0.632 -0.518. Also, these models are divergent and unique. Even though, the present study does not share any models with GFA [8], and MUSEUM [9] results, there are several descriptors common to all these studies, including the present one. Also a simulation is carried out on the same data set to explain the model formation in CP-MLR. The results demonstrate that the proposed method should be able to offer solutions to data sets with 50 to 60 descriptors in reasonable time frame. By carefully selecting the interparameter correlation cutoff values in CP-MLR one can identify divergent models and handle data sets larger than the present one without involving excessive computer time.

[1]  Hxugo Kubiny Variable Selection in QSAR Studies. I. An Evolutionary Algorithm , 1994 .

[2]  S Wold,et al.  Multivariate data analysis and experimental design in biomedical research. , 1988, Progress in medicinal chemistry.

[3]  James H. Wikel,et al.  The use of neural networks for variable selection in QSAR , 1993 .

[4]  Roy E. Welsch,et al.  Efficient Computing of Regression Diagnostics , 1981 .

[5]  P Mátyus,et al.  Application of neural networks in structure–activity relationships , 1999, Medicinal research reviews.

[6]  D. Manallack,et al.  Analysis of linear and nonlinear QSAR data using neural networks. , 1994, Journal of medicinal chemistry.

[7]  A. C. Rencher,et al.  Inflation of R2 in Best Subset Regression , 1980 .

[8]  J. O. Rawlings,et al.  Applied Regression Analysis: A Research Tool , 1988 .

[9]  Valerie J Gillet,et al.  Multiobjective optimization in quantitative structure-activity relationships: deriving accurate and interpretable QSARs. , 2002, Journal of medicinal chemistry.

[10]  D. Livingstone,et al.  Structure-activity relationships of antifilarial antimycin analogues: a multivariate pattern recognition study. , 1990, Journal of medicinal chemistry.

[11]  Stefan H. Unger,et al.  Model building in structure-activity relations. Reexamination of adrenergic blocking activity of .beta.-halo-.beta.-arylalkylamines , 1973 .

[12]  Sung Jin Cho,et al.  Genetic Algorithm Guided Selection: Variable Selection and Subset Selection , 2002, J. Chem. Inf. Comput. Sci..

[13]  M Karplus,et al.  Evolutionary optimization in quantitative structure-activity relationship: an application of genetic neural networks. , 1996, Journal of medicinal chemistry.

[14]  Anton J. Hopfinger,et al.  Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships , 1994, J. Chem. Inf. Comput. Sci..

[15]  Chris L. Waller,et al.  Development and Validation of a Novel Variable Selection Technique with Application to Multidimensional Quantitative Structure-Activity Relationship Studies , 1999, J. Chem. Inf. Comput. Sci..

[16]  H. Kubinyi Variable Selection in QSAR Studies. II. A Highly Efficient Combination of Systematic Search and Evolution , 1994 .

[17]  James W. McFarland,et al.  On Identifying Likely Determinants of Biological Activity in High Dimensional QSAR Problems , 1994 .

[18]  J. Zupan,et al.  Neural Networks in Chemistry , 1993 .