Prediction of the 1-Year Risk of Incident Lung Cancer: Prospective Study Using Electronic Health Records from the State of Maine

Background Lung cancer is the leading cause of cancer death worldwide. Early detection of individuals at risk of lung cancer is critical to reduce the mortality rate. Objective The aim of this study was to develop and validate a prospective risk prediction model to identify patients at risk of new incident lung cancer within the next 1 year in the general population. Methods Data from individual patient electronic health records (EHRs) were extracted from the Maine Health Information Exchange network. The study population consisted of patients with at least one EHR between April 1, 2016, and March 31, 2018, who had no history of lung cancer. A retrospective cohort (N=873,598) and a prospective cohort (N=836,659) were formed for model construction and validation. An Extreme Gradient Boosting (XGBoost) algorithm was adopted to build the model. It assigned a score to each individual to quantify the probability of a new incident lung cancer diagnosis from October 1, 2016, to September 31, 2017. The model was trained with the clinical profile in the retrospective cohort from the preceding 6 months and validated with the prospective cohort to predict the risk of incident lung cancer from April 1, 2017, to March 31, 2018. Results The model had an area under the curve (AUC) of 0.881 (95% CI 0.873-0.889) in the prospective cohort. Two thresholds of 0.0045 and 0.01 were applied to the predictive scores to stratify the population into low-, medium-, and high-risk categories. The incidence of lung cancer in the high-risk category (579/53,922, 1.07%) was 7.7 times higher than that in the overall cohort (1167/836,659, 0.14%). Age, a history of pulmonary diseases and other chronic diseases, medications for mental disorders, and social disparities were found to be associated with new incident lung cancer. Conclusions We retrospectively developed and prospectively validated an accurate risk prediction model of new incident lung cancer occurring in the next 1 year. Through statistical learning from the statewide EHR data in the preceding 6 months, our model was able to identify statewide high-risk patients, which will benefit the population health through establishment of preventive interventions or more intensive surveillance.

[1]  Devore S. Culver,et al.  Assessing Statewide All-Cause Future One-Year Mortality: Prospective Study With Implications for Quality of Life, Resource Utilization, and Medical Futility , 2018, Journal of medical Internet research.

[2]  Kathleen A Cronin,et al.  Validation of a model of lung cancer risk prediction among smokers. , 2006, Journal of the National Cancer Institute.

[3]  V. Lehtinen,et al.  Elevated lung cancer risk among persons with depressed mood. , 1996, American journal of epidemiology.

[4]  M. Teare,et al.  Risk Prediction Models for Lung Cancer: A Systematic Review. , 2016, Clinical lung cancer.

[5]  J. Lubin,et al.  Mood Disorders and Risk of Lung Cancer in the EAGLE Case-Control Study and in the U.S. Veterans Affairs Inpatient Cohort , 2012, PloS one.

[6]  Yang Liu,et al.  Depression and cancer risk: a systematic review and meta-analysis. , 2017, Public health.

[7]  K. Hveem,et al.  A Validated Clinical Risk Prediction Model for Lung Cancer in Smokers of All Ages and Exposure Types: A HUNT Study , 2018, EBioMedicine.

[8]  Bo Jin,et al.  Estimating One-Year Risk of Incident Chronic Kidney Disease: Retrospective Development and Validation Study Using Electronic Medical Record Data From the State of Maine , 2017, JMIR medical informatics.

[9]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[10]  Xifeng Wu,et al.  Cancer risk associated with chronic diseases and disease markers: prospective cohort study , 2018, British Medical Journal.

[11]  Rebecca L. Siegel Mph,et al.  Cancer statistics, 2018 , 2018 .

[12]  Stephanie A Kovalchik,et al.  Development and Validation of Risk Models to Select Ever-Smokers for CT Lung Cancer Screening. , 2016, JAMA.

[13]  Matthew B Schabath,et al.  A risk model for prediction of lung cancer. , 2007, Journal of the National Cancer Institute.

[14]  Darren R. Brenner,et al.  Previous Lung Diseases and Lung Cancer Risk: A Systematic Review and Meta-Analysis , 2011, PloS one.

[15]  R. Hubbard,et al.  Chronic Obstructive Pulmonary Disease and Risk of Lung Cancer: The Importance of Smoking and Timing of Diagnosis , 2013, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[16]  D. Saslow,et al.  Cancer screening in the United States, 2014: A review of current American Cancer Society guidelines and current issues in cancer screening , 2014, CA: a cancer journal for clinicians.

[17]  Benjamin S. Glicksberg,et al.  Identification of type 2 diabetes subgroups through topological analysis of patient similarity , 2015, Science Translational Medicine.

[18]  Bo Jin,et al.  Prediction of Incident Hypertension Within the Next Year: Prospective Study Using Statewide Electronic Health Records and Machine Learning , 2018, Journal of medical Internet research.

[19]  M. Spitz,et al.  An Expanded Risk Prediction Model for Lung Cancer , 2008, Cancer Prevention Research.

[20]  Scott T. Weiss,et al.  Prediction of chronic obstructive pulmonary disease (COPD) in asthma patients using electronic medical records. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[21]  A. Gross,et al.  Depression and cancer risk: 24 years of follow-up of the Baltimore Epidemiologic Catchment Area sample , 2010, Cancer Causes & Control.

[22]  Timothy R Church,et al.  Selection criteria for lung-cancer screening. , 2013, The New England journal of medicine.

[23]  R. Collins,et al.  Effect of interleukin-1β inhibition with canakinumab on incident lung cancer in patients with atherosclerosis: exploratory results from a randomised, double-blind, placebo-controlled trial , 2017, The Lancet.

[24]  Harry J de Koning,et al.  Risk prediction models for selection of lung cancer screening candidates: A retrospective validation study , 2017, PLoS medicine.

[25]  J. Hilden The Area under the ROC Curve and Its Competitors , 1991, Medical decision making : an international journal of the Society for Medical Decision Making.

[26]  P. Sasieni From genotypes to genes: doubling the sample size. , 1997, Biometrics.

[27]  R. Pfeiffer,et al.  Circulating Inflammation Markers, Risk of Lung Cancer, and Utility for Risk Stratification. , 2015, Journal of the National Cancer Institute.

[28]  Teague Ruder,et al.  Multiple Chronic Conditions in the United States , 2017 .

[29]  Xinyue Zhang,et al.  Chronic obstructive pulmonary disease and risk of lung cancer: a meta-analysis of prospective cohort studies , 2017, Oncotarget.

[30]  Martin Sill,et al.  c060: Extended Inference with Lasso and Elastic-Net Regularized Cox and Generalized Linear Models , 2014 .

[31]  John P. A. Ioannidis,et al.  Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review , 2017, J. Am. Medical Informatics Assoc..

[32]  F. Sung,et al.  The Analysis of Depression and Subsequent Cancer Risk in Taiwan , 2011, Cancer Epidemiology, Biomarkers & Prevention.

[33]  Eduardo L. Franco,et al.  Lung Cancer Screening: Review and Performance Comparison Under Different Risk Scenarios , 2014, Lung.

[34]  K. Straif,et al.  Lung cancer and socioeconomic status in a pooled analysis of case-control studies , 2018, PloS one.

[35]  J. Goedert,et al.  C-reactive protein and risk of lung cancer. , 2010, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[36]  Ying Ma,et al.  Electronic medical record-based multicondition models to predict the risk of 30 day readmission or death among adult medicine patients: validation and comparison to existing models , 2015, BMC Medical Informatics and Decision Making.

[37]  J. Zulueta,et al.  Understanding the Links Between Lung Cancer, COPD, and Emphysema: A Key to More Effective Treatment and Screening. , 2017, Oncology.

[38]  Devore S. Culver,et al.  Web-based Real-Time Case Finding for the Population Health Management of Patients With Diabetes Mellitus: A Prospective Validation of the Natural Language Processing–Based Algorithm With Statewide Electronic Medical Records , 2016, JMIR medical informatics.

[39]  Massimo Bellomi,et al.  Lung Cancer Risk Prediction to Select Smokers for Screening CT—a Model Based on the Italian COSMOS Trial , 2011, Cancer Prevention Research.

[40]  S W Duffy,et al.  Clinical Studies , 1877, Journal of Psychological Medicine and Mental Pathology (London, England : 1875).

[41]  G. Giles,et al.  Inflammatory Cytokines and Lung Cancer Risk in 3 Prospective Studies , 2017, American journal of epidemiology.

[42]  D. Wennberg,et al.  Case finding for patients at risk of readmission to hospital: development of algorithm to identify high risk patients , 2006, BMJ : British Medical Journal.

[43]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[44]  Strother H. Walker,et al.  Estimation of the probability of an event as a function of several independent variables. , 1967, Biometrika.

[45]  Usha Sambamoorthi,et al.  Multiple chronic conditions and healthcare costs among adults , 2015, Expert review of pharmacoeconomics & outcomes research.

[46]  P. Peters,et al.  Social determinants of lung cancer incidence in Canada: A 13-year prospective study. , 2015, Health reports.

[47]  Devore S. Culver,et al.  Development, Validation and Deployment of a Real Time 30 Day Hospital Readmission Risk Assessment Tool in the Maine Healthcare Information Exchange , 2015, PloS one.

[48]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[49]  M. Inoue,et al.  Development of a risk prediction model for lung cancer: The Japan Public Health Center‐based Prospective Study , 2018, Cancer science.

[50]  Jianrong Zhang,et al.  Updated statistics of lung and bronchus cancer in United States (2018). , 2018, Journal of thoracic disease.

[51]  C. Begg,et al.  Variations in lung cancer risk among smokers. , 2003, Journal of the National Cancer Institute.

[52]  M. Kawano,et al.  Lung cancer in connective tissue disease-associated interstitial lung disease: clinical features and impact on outcomes. , 2018, Journal of thoracic disease.

[53]  C. Hoggart,et al.  A risk model for lung cancer incidence. , 2012, Cancer prevention research.

[54]  P. Prorok,et al.  Lung cancer risk prediction: Prostate, Lung, Colorectal And Ovarian Cancer Screening Trial models and validation. , 2011, Journal of the National Cancer Institute.

[55]  R. Pfeiffer,et al.  Circulating inflammation markers and prospective risk for lung cancer. , 2013, Journal of the National Cancer Institute.

[56]  D. Lynch,et al.  The National Lung Screening Trial: overview and study design. , 2011, Radiology.

[57]  H. Aref,et al.  CRP evaluation in non-small cell lung cancer , 2014 .

[58]  M. Spitz,et al.  Development and Validation of a Lung Cancer Risk Prediction Model for African-Americans , 2008, Cancer Prevention Research.

[59]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[60]  L. Tanoue,et al.  Lung cancer: epidemiology, etiology, and prevention. , 2011, Clinics in chest medicine.

[61]  R. Klein,et al.  Physical Activity, White Blood Cell Count, and Lung Cancer Risk in a Prospective Cohort Study , 2008, Cancer Epidemiology Biomarkers & Prevention.