Determining relative importance of variables in developing and validating predictive models

BackgroundMultiple regression models are used in a wide range of scientific disciplines and automated model selection procedures are frequently used to identify independent predictors. However, determination of relative importance of potential predictors and validating the fitted models for their stability, predictive accuracy and generalizability are often overlooked or not done thoroughly.MethodsUsing a case study aimed at predicting children with acute lymphoblastic leukemia (ALL) who are at low risk of Tumor Lysis Syndrome (TLS), we propose and compare two strategies, bootstrapping and random split of data, for ordering potential predictors according to their relative importance with respect to model stability and generalizability. We also propose an approach based on relative increase in percentage of explained variation and area under the Receiver Operating Characteristic (ROC) curve for developing models where variables from our ordered list enter the model according to their importance. An additional data set aimed at identifying predictors of prostate cancer penetration is also used for illustrative purposes.ResultsAge is chosen to be the most important predictor of TLS. It is selected 100% of the time using the bootstrapping approach. Using the random split method, it is selected 99% of the time in the training data and is significant (at 5% level) 98% of the time in the validation data set. This indicates that age is a stable predictor of TLS with good generalizability. The second most important variable is white blood cell count (WBC). Our methods also identified an important predictor of TLS that was otherwise omitted if relying on any of the automated model selection procedures alone. A group at low risk of TLS consists of children younger than 10 years of age, without T-cell immunophenotype, whose baseline WBC is < 20 × 109/L and palpable spleen is < 2 cm. For the prostate cancer data set, the Gleason score and digital rectal exam are identified to be the most important indicators of whether tumor has penetrated the prostate capsule.ConclusionOur model selection procedures based on bootstrap re-sampling and repeated random split techniques can be used to assess the strength of evidence that a variable is truly an independent and reproducible predictor. Our methods, therefore, can be used for developing stable and reproducible models with good performances. Moreover, our methods can serve as a good tool for validating a predictive model. Previous biological and clinical studies support the findings based on our selection and validation strategies. However, extensive simulations may be required to assess the performance of our methods under different scenarios as well as check their sensitivity to a random fluctuation in the data.

[1]  J. Habbema,et al.  Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. , 2001, Journal of clinical epidemiology.

[2]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[3]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[4]  J Carpenter,et al.  Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians. , 2000, Statistics in medicine.

[5]  P. Murtaugh,et al.  METHODS OF VARIABLE SELECTION IN REGRESSION MODELING , 1998 .

[6]  Lennart Franzén,et al.  How well does the Gleason score predict prostate cancer death? A 20-year followup of a population based cohort in Sweden. , 2006, The Journal of urology.

[7]  Alan J. Miller Sélection of subsets of regression variables , 1984 .

[8]  M. Schemper Predictive accuracy and explained variation , 2003, Statistics in medicine.

[9]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[10]  D G Altman,et al.  Statistical reviewing policies of medical journals: caveat lector? , 1998, Journal of general internal medicine.

[11]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[12]  J. Concato,et al.  The Risk of Determining Risk with Multivariable Models , 1993, Annals of Internal Medicine.

[13]  R. Mikolajczyk,et al.  Evaluation of Logistic Regression Reporting in Current Obstetrics and Gynecology Literature , 2008, Obstetrics and gynecology.

[14]  M. West,et al.  Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Peter C Austin,et al.  Bootstrap Methods for Developing Predictive Models , 2004 .

[16]  Holly Dressman,et al.  Towards integrated clinico-genomic models for personalized medicine: combining gene expression signatures and clinical factors in breast cancer outcomes prediction. , 2003, Human molecular genetics.

[17]  D. A. Wellman,et al.  An appraisal of multivariable logistic models in the pulmonary and critical care literature. , 2003, Chest.

[18]  D. Heitjan,et al.  A predictive model for the detection of tumor lysis syndrome during AML induction therapy , 2006, Leukemia & lymphoma.

[19]  H. Keselman,et al.  Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables , 1992 .

[20]  R. Sokol,et al.  Evaluation of logistic regression reporting in current obstetrics and gynecology literature. , 2008, Obstetrics and gynecology.

[21]  Peter C Austin,et al.  Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. , 2004, Journal of clinical epidemiology.

[22]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[23]  J. Morote,et al.  Use of Percent Free Prostate–Specific Antigen as a Predictor of the Pathological Features of Clinically Localized Prostate Cancer , 2000, European Urology.

[24]  A. Maloney,et al.  Features at presentation predict children with acute lymphoblastic leukemia at low risk for tumor lysis syndrome , 2007, Cancer.

[25]  T. Stamey,et al.  Preoperative serum prostate specific antigen does not reflect biochemical failure rates after radical prostatectomy in men with large volume cancers. , 2000, The Journal of urology.

[26]  U. Ferreira,et al.  Gleason score as predictor of clinicopathologic findings and biochemical (PSA) progression following radical prostatectomy. , 2008, International braz j urol : official journal of the Brazilian Society of Urology.

[27]  U Grouven,et al.  Logistic regression models used in medical research are poorly presented , 1996, BMJ.

[28]  Stanley Lemeshow,et al.  Applied Logistic Regression, Second Edition , 1989 .

[29]  M Schumacher,et al.  A bootstrap resampling procedure for model building: application to the Cox regression model. , 1992, Statistics in medicine.

[30]  Yuhong Yang Can the Strengths of AIC and BIC Be Shared , 2005 .

[31]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[32]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[33]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[34]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[35]  J. Listgarten,et al.  Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms , 2004, Clinical Cancer Research.

[36]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[37]  D G Altman,et al.  Statistics in medical journals: developments in the 1980s. , 1991, Statistics in medicine.

[38]  Avrum Spira,et al.  A Prediction Model for Lung Cancer Diagnosis that Integrates Genomic and Clinical Features , 2008, Cancer Prevention Research.