Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records

Background and aims The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35–50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors. Methods We enrolled 3,116 adults aged 35–50 at average-risk for CRC and underwent colonoscopy between 2017–2020 at a single center. Prediction outcomes were (1) CRC and (2) CRC or high-risk polyps. We derived our predictors from EHRs (e.g., demographics, obesity, laboratory values, medications, and zip code-derived factors). We constructed four machine learning-based models using a training set (random sample of 70% of participants): regularized discriminant analysis, random forest, neural network, and gradient boosting decision tree. In the testing set (remaining 30% of participants), we measured predictive performance by comparing C-statistics to a reference model (logistic regression). Results The study sample was 55.1% female, 32.8% non-white, and included 16 (0.05%) CRC cases and 478 (15.3%) cases of CRC or high-risk polyps. All machine learning models predicted CRC with higher discriminative ability compared to the reference model [e.g., C-statistics (95%CI); neural network: 0.75 (0.48–1.00) vs. reference: 0.43 (0.18–0.67); P = 0.07] Furthermore, all machine learning approaches, except for gradient boosting, predicted CRC or high-risk polyps significantly better than the reference model [e.g., C-statistics (95%CI); regularized discriminant analysis: 0.64 (0.59–0.69) vs. reference: 0.55 (0.50–0.59); P<0.0015]. The most important predictive variables in the regularized discriminant analysis model for CRC or high-risk polyps were income per zip code, the colonoscopy indication, and body mass index quartiles. Discussion Machine learning can predict CRC risk in adults aged 35–50 using EHR with improved discrimination. Further development of our model is needed, followed by validation in a primary-care setting, before clinical application.

[1]  D. Bishop,et al.  Risk Stratification for Early-Onset Colorectal Cancer Using a Combination of Genetic and Environmental Risk Scores: An International Multi-Center Study. , 2022, Journal of the National Cancer Institute.

[2]  P. Stanich,et al.  A High Percentage of Early-Age Onset Colorectal Cancer is Potentially Preventable. , 2020, Gastroenterology.

[3]  R. Hayes,et al.  Risk Factors Associated With Early-onset Colorectal Cancer. , 2020, Clinical gastroenterology and hepatology : the official clinical practice journal of the American Gastroenterological Association.

[4]  D. Ransohoff,et al.  Derivation and validation of a predictive model for advanced colorectal neoplasia in asymptomatic adults , 2020, Gut.

[5]  M. Kurien,et al.  Diagnostic Yield of Colonoscopy in Patients with Symptoms Compatible with Rome IV Functional Bowel Disorders. , 2020, Clinical gastroenterology and hepatology : the official clinical practice journal of the American Gastroenterological Association.

[6]  Viktor E. Krebs,et al.  Machine Learning and Artificial Intelligence: Definitions, Applications, and Future Directions , 2020, Current Reviews in Musculoskeletal Medicine.

[7]  D. Ahnen,et al.  Trends in Incidence of Early-Onset Colorectal Cancer in the United States Among Those Approaching Screening Age , 2020, JAMA network open.

[8]  E. Stoffel,et al.  Epidemiology and Mechanisms of the Increasing Incidence of Colon and Rectal Cancers in Young Adults. , 2020, Gastroenterology.

[9]  A. Jemal,et al.  Global patterns and trends in colorectal cancer incidence in young adults , 2019, Gut.

[10]  Gregory R. Hart,et al.  Scoring colorectal cancer risk with an artificial neural network based on self-reportable personal health data , 2019, PloS one.

[11]  H. Brenner,et al.  Head-to-Head Comparison of the Performance of 17 Risk Models for Predicting Presence of Advanced Neoplasms in Colorectal Cancer Screening , 2019, The American journal of gastroenterology.

[12]  U. Ladabaum,et al.  Cost-Effectiveness and National Effects of Initiating Colorectal Cancer Screening for Average-Risk Persons at Age 45 Years Instead of 50 Years. , 2019, Gastroenterology.

[13]  Ivo D. Dinov,et al.  Machine learning techniques for personalized breast cancer risk prediction: comparison with the BCRAT and BOADICEA models , 2019, Breast Cancer Research.

[14]  Christopher V. Almario,et al.  Burden of Gastrointestinal Symptoms in the United States: Results of a Nationally Representative Survey of Over 71,000 Americans , 2018, The American Journal of Gastroenterology.

[15]  H. Tariq,et al.  Predicting the presence of adenomatous polyps during colonoscopy with National Cancer Institute Colorectal Cancer Risk-Assessment Tool , 2018, World journal of gastroenterology.

[16]  P. Pharoah,et al.  Cost-effectiveness and Benefit-to-Harm Ratio of Risk-Stratified Screening for Breast Cancer , 2018, JAMA oncology.

[17]  C. Flowers,et al.  Colorectal cancer screening for average‐risk adults: 2018 guideline update from the American Cancer Society , 2018, CA: a cancer journal for clinicians.

[18]  C. Langlotz,et al.  Performance of a Deep-Learning Neural Network Model in Assessing Skeletal Maturity on Pediatric Hand Radiographs. , 2017, Radiology.

[19]  Gregory S. Corrado,et al.  Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning , 2017, Nature Biomedical Engineering.

[20]  Philip S Rosenberg,et al.  Colorectal Cancer Incidence Patterns in the United States, 1974–2013 , 2017, Journal of the National Cancer Institute.

[21]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[22]  Michel E. Vandenberghe,et al.  Relevance of deep learning to facilitate the diagnosis of HER2 status in breast cancer , 2017, Scientific Reports.

[23]  D. Lieberman,et al.  Recommendations on Fecal Immunochemical Testing to Screen for Colorectal Neoplasia: A Consensus Statement by the US Multi-Society Task Force on Colorectal Cancer. , 2017, Gastroenterology.

[24]  L. Laine,et al.  Quantification of Adequate Bowel Preparation for Screening or Surveillance Colonoscopy in Men. , 2016, Gastroenterology.

[25]  R. Heaney,et al.  Effect of Ebola Progression in Liberia , 2015, Annals of Internal Medicine.

[26]  J. Skibber,et al.  Overtreatment of young adults with colon cancer: more intense treatments with unmatched survival gains. , 2015, JAMA surgery.

[27]  G. Collins,et al.  Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement , 2015, British Journal of Cancer.

[28]  A. Ramirez,et al.  Risk factors for delay in symptomatic presentation: a survey of cancer patients , 2014, British Journal of Cancer.

[29]  S. Kueh,et al.  The diagnostic yield of colonoscopy in patients with isolated abdominal pain. , 2013, The New Zealand medical journal.

[30]  Perry J Pickhardt,et al.  Assessment of volumetric growth rates of small colorectal polyps with CT colonography: a longitudinal study of natural history. , 2013, The Lancet. Oncology.

[31]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[32]  H. Brenner,et al.  Sojourn time of preclinical colorectal cancer by sex and age: estimates from the German national screening colonoscopy database. , 2011, American journal of epidemiology.

[33]  Leonardo Franco,et al.  Missing data imputation using statistical and machine learning methods in a real breast cancer problem , 2010, Artif. Intell. Medicine.

[34]  Hardeep Singh,et al.  Reducing referral delays in colorectal cancer diagnosis: is it about how you ask? , 2010, Quality and Safety in Health Care.

[35]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[36]  M. Gail,et al.  Validation of a colorectal cancer risk prediction model among white patients age 50 years and older. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[37]  S. Nikpour,et al.  Colonoscopic evaluation of minimal rectal bleeding in average-risk patients for colorectal cancer. , 2008, World journal of gastroenterology.

[38]  R. Vierkant,et al.  Young-Onset Colorectal Cancer in Patients With No Known Genetic Predisposition: Can We Increase Early Recognition and Improve Outcome? , 2008, Medicine.

[39]  E. Alegrı́a,et al.  Comparison of serum lipid values in subjects with and without the metabolic syndrome. , 2008, The American journal of cardiology.

[40]  Douglas K. Rex,et al.  Quality Indicators for Colonoscopy , 2006, Gastrointestinal endoscopy.

[41]  Tracey McLaughlin,et al.  Use of Metabolic Markers To Identify Overweight Individuals Who Are Insulin Resistant , 2003, Annals of Internal Medicine.

[42]  W. Härdle,et al.  Applied Multivariate Statistical Analysis , 2003 .

[43]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[44]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[45]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Colorectal Cancer Screening in Average-Risk Adults , 2019, Annals of Internal Medicine.

[47]  Guang-Zhong Yang,et al.  Deep Learning for Health Informatics , 2017, IEEE Journal of Biomedical and Health Informatics.

[48]  D. Lieberman,et al.  Recommendations on Fecal Immunochemical Testing to Screen for Colorectal Neoplasia: A Consensus Statement by the US Multi-Society Task Force on Colorectal Cancer , 2017, The American Journal of Gastroenterology.

[49]  S. Cantor,et al.  Increasing disparities in the age-related incidences of colon and rectal cancers in the United States, 1975-2010. , 2015, JAMA surgery.

[50]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.