Mortality risk score prediction in an elderly population using machine learning.

Standard practice for prediction often relies on parametric regression methods. Interesting new methods from the machine learning literature have been introduced in epidemiologic studies, such as random forest and neural networks. However, a priori, an investigator will not know which algorithm to select and may wish to try several. Here I apply the super learner, an ensembling machine learning approach that combines multiple algorithms into a single algorithm and returns a prediction function with the best cross-validated mean squared error. Super learning is a generalization of stacking methods. I used super learning in the Study of Physical Performance and Age-Related Changes in Sonomans (SPPARCS) to predict death among 2,066 residents of Sonoma, California, aged 54 years or more during the period 1993-1999. The super learner for predicting death (risk score) improved upon all single algorithms in the collection of algorithms, although its performance was similar to that of several algorithms. Super learner outperformed the worst algorithm (neural networks) by 44% with respect to estimated cross-validated mean squared error and had an R2 value of 0.201. The improvement of super learner over random forest with respect to R2 was approximately 2-fold. Alternatives for risk score prediction include the super learner, which can provide improved performance.

[1]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[2]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[3]  W. Kannel,et al.  A general cardiovascular risk profile: the Framingham Study. , 1976, The American journal of cardiology.

[4]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[5]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[6]  M. Gail,et al.  Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. , 1989, Journal of the National Cancer Institute.

[7]  J. Friedman Multivariate adaptive regression splines , 1990 .

[8]  K. Anderson,et al.  An updated coronary risk profile. A statement for health professionals. , 1991, Circulation.

[9]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[10]  J F Sallis,et al.  Compendium of physical activities: classification of energy costs of human physical activities. , 1993, Medicine and science in sports and exercise.

[11]  L. Ramsay,et al.  The Sheffield table for primary prevention of coronary heart disease: corrected , 1996, The Lancet.

[12]  I. U. Haq,et al.  Sheffield risk and treatment table for cholesterol lowering in prevention of coronary heart disease , 1995, The Lancet.

[13]  Y. Benyamini,et al.  Self-rated health and mortality: a review of twenty-seven community studies. , 1997, Journal of health and social behavior.

[14]  W. Scott,et al.  Functional health status as a predictor of mortality in men and women over 65. , 1997, Journal of clinical epidemiology.

[15]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.

[16]  R A Kronmal,et al.  Risk factors for 5-year mortality in older adults: the Cardiovascular Health Study. , 1998, JAMA.

[17]  W. Satariano,et al.  Association between self-reported leisure-time physical activity and measures of cardiorespiratory fitness in an elderly population. , 1998, American journal of epidemiology.

[18]  J Benichou,et al.  Validation studies for models projecting the risk of invasive and total breast cancer incidence. , 1999, Journal of the National Cancer Institute.

[19]  Thomas G. Dietterich Ensemble Methods in Machine Learning , 2000, Multiple Classifier Systems.

[20]  R. Jackson,et al.  Updated New Zealand cardiovascular disease risk-benefit prediction guide , 2000, BMJ : British Medical Journal.

[21]  Josef Kittler,et al.  Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21-23, 2000 Proceedings , 2000 .

[22]  J. Robins,et al.  Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. , 2000, Epidemiology.

[23]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[24]  S. Dudoit,et al.  Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples , 2003 .

[25]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  A. Manley Physical Activity And Health: A Report Of The Surgeon General , 2004 .

[28]  Stephen W Duffy,et al.  A breast cancer prediction model incorporating familial and personal risk factors , 2004, Hereditary Cancer in Clinical Practice.

[29]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[30]  L. Breiman Stacked Regressions , 1996, Machine Learning.

[31]  Karla Kerlikowske,et al.  Prospective breast cancer risk prediction model for women undergoing screening mammography. , 2006, Journal of the National Cancer Institute.

[32]  Florentina Bunea,et al.  Aggregation and sparsity via 1 penalized least squares , 2006 .

[33]  M. J. van der Laan,et al.  The International Journal of Biostatistics Targeted Maximum Likelihood Learning , 2011 .

[34]  H. Stassen,et al.  Modeling activation of inflammatory response system: a molecular-genetic neural network analysis , 2007, BMC proceedings.

[35]  Arnak S. Dalalyan,et al.  Aggregation by Exponential Weighting and Sharp Oracle Inequalities , 2007, COLT.

[36]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[37]  Xin Li,et al.  Data mining, neural nets, trees — Problems 2 and 3 of Genetic Analysis Workshop 15 , 2007, Genetic epidemiology.

[38]  Yan V. Sun,et al.  Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests , 2007, BMC proceedings.

[39]  M. G. Pittau,et al.  A weakly informative default prior distribution for logistic and other regression models , 2008, 0901.4011.

[40]  M. Gail Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk. , 2008, Journal of the National Cancer Institute.

[41]  Luigi Ferrucci,et al.  Personality Predictors of Longevity: Activity, Emotional Stability, and Conscientiousness , 2008, Psychosomatic medicine.

[42]  Arnak S. Dalalyan,et al.  Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity , 2008, Machine Learning.

[43]  D. Blazer How do you feel about...? Health outcomes in late life and self-perceptions of health and well-being. , 2008, The Gerontologist.

[44]  M. J. van der Laan,et al.  Leisure-time Physical Activity and All-cause Mortality in an Elderly Cohort , 2009, Epidemiology.

[45]  Johanna M Seddon,et al.  Prediction model for prevalence and incidence of advanced age-related macular degeneration based on genetic, demographic, and environmental variables. , 2009, Investigative ophthalmology & visual science.

[46]  Karl E. Peace,et al.  Design and Analysis of Clinical Trials with Time to Event Endpoints , 2009 .

[47]  D. Mozaffarian,et al.  The Preventable Causes of Death in the United States: Comparative Risk Assessment of Dietary, Lifestyle, and Metabolic Risk Factors , 2009, PLoS medicine.

[48]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[49]  S. Peng,et al.  Random forest can predict 30‐day mortality of spontaneous intracerebral hemorrhage with remarkable discrimination , 2010, European journal of neurology.

[50]  Peter C Austin,et al.  Logistic regression had superior performance compared with regression trees for predicting in-hospital mortality in patients hospitalized with heart failure. , 2010, Journal of clinical epidemiology.

[51]  Susan C. Miller,et al.  The advanced dementia prognostic tool: a risk score to estimate survival in nursing home residents with advanced dementia. , 2010, Journal of pain and symptom management.

[52]  M. Thun,et al.  Performance of Common Genetic Variants in Breast-cancer Risk Models , 2022 .

[53]  E. Seto,et al.  Using variable importance measures from causal inference to rank risk factors of schistosomiasis infection in a rural setting in China , 2010, Epidemiologic perspectives & innovations : EP+I.

[54]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[55]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[56]  Johanna M Seddon,et al.  Risk models for progression to advanced age-related macular degeneration using demographic, environmental, genetic, and ocular factors. , 2011, Ophthalmology.

[57]  Gustavo Saposnik,et al.  IScore: A Risk Score to Predict Death Early After Hospitalization for an Acute Ischemic Stroke , 2011, Circulation.

[58]  Gustavo Saposnik,et al.  The iScore Predicts Poor Functional Outcomes Early After Hospitalization for an Acute Ischemic Stroke , 2011, Stroke.

[59]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[60]  Mark J. van der Laan,et al.  Super Learning for Right-Censored Data , 2011 .

[61]  Sherri Rose,et al.  Implementation of G-computation on a simulated data set: demonstration of a causal inference technique. , 2011, American journal of epidemiology.

[62]  M. J. Laan,et al.  Nested Case-Control Risk Score Prediction , 2011 .

[63]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .