Breast Cancer Risk Prediction Using Electronic Health Records

Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHRs in breast cancer risk prediction. We conducted a retrospective case-control study, gathering patients' ICD-9 diagnosis codes from an existing EHR data repository. Based on the hierarchical structure of ICD-9 codes, which are composed of 3-5 digits, three levels of data representation were studied: level 0, using only the first 3 digits; level 1, using up to the first 4 digits; and level 2, using up to the full 5 digits of each code. We created two models to predict breast cancer one year in advance based on diagnosis codes in three levels of data representation: logistic regression (LR) and LASSO logistic regression (LR+Lasso). Area under the ROC curve (AUC) was used to assess model performance. The LR+Lasso model demonstrated significantly higher predictive performance than the LR model when using the level 2 feature representation (0.648 vs 0.603, p=0.013). For both the level 1 representation and the level 0 representation, the predictive difference between LR+Lasso and LR model was not significant, (0.634 vs 0.604, p=0.081) and (0.612 vs 0.603, p=0.523), respectively. For LR model, predictive performance changed modestly across three levels. For LR+Lasso model, predictive performance also changed modestly from the level 0 to the level 1representation (p=0.168) and from the level 1 to the level 2 representation (p=0.374). However, the level 2 representation provided significantly higher predictive performance than the level 0 representation (p=0.034). The unabridged level 2 representation of the diagnosis codes contains the most valuable information that may contribute to breast cancer risk prediction. The performance of these models demonstrates that EHR data can be used to predict breast cancer risk, which provides the possibility to personalize care in clinical practice. In the future, we will combine coded EHR data with demographic risk factors, genetic variants, and imaging features to improve breast cancer risk prediction.

[1]  A. Dalalyan,et al.  On the Prediction Performance of the Lasso , 2014, 1402.1700.

[2]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[3]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[4]  Karla Kerlikowske,et al.  Prospective breast cancer risk prediction model for women undergoing screening mammography. , 2006, Journal of the National Cancer Institute.

[5]  Ammarin Thakkinstian,et al.  Risk prediction models of breast cancer: a systematic review of model performances , 2012, Breast Cancer Research and Treatment.

[6]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[7]  M. Gail Discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk. , 2008, Journal of the National Cancer Institute.

[8]  M. Gail,et al.  Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. , 1989, Journal of the National Cancer Institute.

[9]  Yirong Wu,et al.  A Comprehensive Methodology for Determining the Most Informative Mammographic Features , 2013, Journal of Digital Imaging.

[10]  Richard D. Riley,et al.  A systematic review of breast cancer incidence risk prediction models with meta-analysis of their performance , 2012, Breast Cancer Research and Treatment.

[11]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[12]  C. D. Page,et al.  Probabilistic computer model developed from clinical data in national mammography database format to classify mammographic findings. , 2009, Radiology.

[13]  E. Burnside,et al.  A logistic regression model based on the national mammography database format to aid breast cancer diagnosis. , 2009, AJR. American journal of roentgenology.

[14]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15]  Peter Devilee,et al.  A tiny step closer to personalized risk prediction for breast cancer. , 2010, The New England journal of medicine.

[16]  Jinbo Chen,et al.  Projecting absolute invasive breast cancer risk in white women with a model that includes mammographic density. , 2006, Journal of the National Cancer Institute.

[17]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[18]  C. D. Page,et al.  Comparing Mammography Abnormality Features to Genetic Variants in the Prediction of Breast Cancer in Women Recommended for Breast Biopsy. , 2016, Academic radiology.

[19]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  M. Gail Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. , 2009, Journal of the National Cancer Institute.

[22]  D Spiegelman,et al.  Validation of the Gail et al. model of breast cancer risk prediction and implications for chemoprevention. , 2001, Journal of the National Cancer Institute.

[23]  Jun Fan,et al.  Structure-Leveraged Methods in Breast Cancer Risk Prediction , 2016, J. Mach. Learn. Res..

[24]  W. Benish Mutual Information as an Index of Diagnostic Test Performance , 2003, Methods of Information in Medicine.

[25]  Karla Kerlikowske,et al.  Using Clinical Factors and Mammographic Breast Density to Estimate Breast Cancer Risk: Development and Validation of a New Predictive Model , 2008, Annals of Internal Medicine.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Svetha Venkatesh,et al.  Stable feature selection for clinical prediction: Exploiting ICD tree structure using Tree-Lasso , 2015, J. Biomed. Informatics.

[28]  M. Thun,et al.  Performance of Common Genetic Variants in Breast-cancer Risk Models , 2022 .

[29]  C. McCarty,et al.  Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. , 2005, Personalized medicine.