Impact of correlation on predictive ability of biomarkers

In this paper, we investigate how the correlation structure of independent variables affects the discrimination of risk prediction model. Using multivariate normal data and binary outcome, we prove that zero correlation among predictors is often detrimental for discrimination in a risk prediction model and negatively correlated predictors with positive effect sizes are beneficial. A very high multiple R-squared from regressing the new predictor on the old ones can also be beneficial. As a practical guide to new variable selection, we recommend to select predictors that have negative correlation with the risk score based on the existing variables. This step is easy to implement even when the number of new predictors is large. We illustrate our results by using real-life Framingham data suggesting that the conclusions hold outside of normality. The findings presented in this paper might be useful for preliminary selection of potentially important predictors, especially is situations where the number of predictors is large.

[1]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[2]  B. Efron The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis , 1975 .

[3]  M. Pencina,et al.  General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study , 2008, Circulation.

[4]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[5]  W. G. Cochran On the Performance of the Linear Discriminant Function , 1964 .

[6]  M. Gail,et al.  Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. , 1989, Journal of the National Cancer Institute.

[7]  Melissa Bondy,et al.  Projecting individualized absolute invasive breast cancer risk in African American women. , 2007, Journal of the National Cancer Institute.

[8]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[9]  Olga V. Demler,et al.  Equivalence of improvement in area under ROC curve and linear discriminant analysis coefficient under assumption of normality , 2011, Statistics in medicine.

[10]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.

[11]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[12]  E. S. Pearson,et al.  Tests for departure from normality: Comparison of powers , 1977 .

[13]  S. Weisberg Applied Linear Regression: Weisberg/Applied Linear Regression 3e , 2005 .

[14]  Jun S. Liu,et al.  Linear Combinations of Multiple Diagnostic Markers , 1993 .

[15]  K. Anderson,et al.  Cardiovascular disease risk profiles. , 1991, American heart journal.

[16]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.