Is your QSAR/QSPR descriptor real or trash?

The sign change problem in quantitative structure–activity relationship (QSAR), quantitative structure–property relationship (QSPR) and related studies is the controversy related to the signs of correlation coefficients and regression coefficients of a descriptor in univariate and multivariate regressions, before and after the data split. Among 50 investigated regression models with 227 descriptors extracted from the literature, the sign change problem was shown to have a very high frequency, according to four new criteria proposed in this work for its assessment. The sign change problem can be substantially reduced and even eliminated for a given dataset by statistically based variable selection and by checking for the sign change problem before model validation and interpretation. Knowing the fundamentals of statistics related to the sign change problem, its identification and understanding aid in finding effective means to remedy regression models with this deficiency. Copyright © 2010 John Wiley & Sons, Ltd.

[1]  John F. MacGregor,et al.  Interpretation of regression coefficients under a latent variable regression model , 2001 .

[2]  P. Seybold,et al.  Synergistic interactions among QSAR descriptors , 2004 .

[3]  S. Weisberg Applied Linear Regression: Weisberg/Applied Linear Regression 3e , 2005 .

[4]  Hugo Kubinyi,et al.  Quantitative Structure–Activity Relationships in Drug Design , 2002 .

[5]  Eric R. Ziegel,et al.  Handbook of Chemometrics and Qualimetrics, Part B , 2000, Technometrics.

[6]  M. Graham CONFRONTING MULTICOLLINEARITY IN ECOLOGICAL MULTIPLE REGRESSION , 2003 .

[7]  Joan Garfield,et al.  The challenge of developing statistical literacy, reasoning and thinking , 2004 .

[8]  Peter C. Jurs,et al.  Quantitative Structure–Property Relationships (QSPR) , 2002 .

[9]  Bernhard W. Flury Understanding Partial Statistics and Redundancy of Variables in Regression and Discriminant Analysis , 1989 .

[10]  G. Rücker,et al.  Simpson's paradox visualized: The example of the Rosiglitazone meta-analysis , 2008, BMC medical research methodology.

[11]  Marvin Charton The nature of topological parameters. I. Are topological parameters `fundamental properties'? , 2003, Journal of computer-aided molecular design.

[12]  Steven D. Brown Introduction to Multivariate Statistical Analysis in Chemometrics , 2010 .

[13]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[14]  David J. Livingstone,et al.  Data analysis for chemists , 1995 .

[15]  Márcia M. C. Ferreira,et al.  Basic validation procedures for regression models in QSAR and QSPR studies: theory and application , 2009 .

[16]  P. Gemperline Practical Guide To Chemometrics , 2006 .

[17]  David S Wishart,et al.  Introduction to cheminformatics. , 2007, Current protocols in bioinformatics.

[18]  T. Wilkerson Events and their Names , 1990 .

[19]  David E. Booth,et al.  Applied Multivariate Analysis , 2003, Technometrics.

[20]  J. Jackson Wiley Series in Probability and Mathematical Statistics , 2004 .

[21]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[22]  Márcia M. C. Ferreira,et al.  On Heteroaromaticity of Nucleobases. Bond Lengths as Multidimensional Phenomena , 2003, J. Chem. Inf. Comput. Sci..

[23]  Sreenivasa Rao Jammalamadaka,et al.  Linear Models: An Integrated Approach , 2003 .

[24]  Howard Mark,et al.  Chemometrics in Spectroscopy , 2007 .

[25]  Andrew P Worth,et al.  Quantitative structure-activity-activity and quantitative structure-activity investigations of human and rodent toxicity. , 2006, Chemosphere.

[26]  Malik Beshir Malik,et al.  Applied Linear Regression , 2005, Technometrics.

[27]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[28]  Alan Graham,et al.  Developing Thinking in Statistics , 2006 .

[29]  Constantin F. Aliferis,et al.  Causal Feature Selection , 2007 .

[30]  David L. Sjoquist,et al.  Understanding Regression Analysis , 1986 .

[31]  J. Dearden,et al.  How not to develop a quantitative structure–activity or structure–property relationship (QSAR/QSPR) , 2009, SAR and QSAR in environmental research.

[32]  Neil Salkind Encyclopedia of Measurement and Statistics , 2006 .

[33]  Ali S. Hadi,et al.  Regression Analysis by Example: Chatterjee/Regression , 2006 .

[34]  M. Ferreira,et al.  Multivariate QSAR , 2002 .