An Improved Training Algorithm Based on Ensemble Penalized Cox Regression for Predicting Absolute Cancer Risk

Introduction Biases in cancer incidence characteristics have led to significant imbalances in databases constructed by prospective cohort studies. Since they use imbalanced databases, many traditional algorithms for training cancer risk prediction models perform poorly. Methods To improve prediction performance, we introduced a Bagging ensemble framework to an absolute risk model based on ensemble penalized Cox regression (EPCR). We then tested whether the EPCR model outperformed other traditional regression models by varying the censoring rate of the simulated data. Results Six different simulation studies were performed with 100 replicates. To assess model performance, we calculated mean false discovery rate, false omission rate, true positive rate, true negative rate, and the areas under the receiver operating characteristic curve (AUC) values. We found that the EPCR procedure could reduce the false discovery rate (FDR) for important variables at the same true positive rate (TPR), thereby achieving more accurate variable screening. In addition, we used the EPCR procedure to build a breast cancer risk prediction model based on the Breast Cancer Cohort Study in Chinese Women database. AUCs for 3- and 5-year predictions were 0.691 and 0.642, representing improvements of 0.189 and 0.117 over the classical Gail model, respectively. Discussion We conclude that the EPCR procedure can overcome challenges posed by imbalanced data and improve the performance of cancer risk assessment tools.

[1]  X. Shu,et al.  252Development and validation of a breast cancer absolute risk prediction model in Chinese population , 2021, International Journal of Epidemiology.

[2]  X. Shu,et al.  Development and external validation of a breast cancer absolute risk prediction model in Chinese population , 2021, Breast cancer research : BCR.

[3]  Salem Alelyani Stable bagging feature selection on medical data , 2021, Journal of Big Data.

[4]  C. Geng,et al.  [The Breast Cancer Cohort Study in Chinese Women: the methodology of population-based cohort and baseline characteristics]. , 2020, Zhonghua liu xing bing xue za zhi = Zhonghua liuxingbingxue zazhi.

[5]  Gautam Srivastava,et al.  Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques , 2019, IEEE Access.

[6]  J. Cavanaugh,et al.  Partial Likelihood , 2018, Wiley StatsRef: Statistics Reference Online.

[7]  Gang Wang,et al.  An efficient diagnosis system for detection of Parkinson's disease using fuzzy k-nearest neighbor approach , 2013, Expert Syst. Appl..

[8]  Chengqi Zhang,et al.  Empirical Study of Bagging Predictors on Medical Data , 2011, AusDM.

[9]  Mu Zhu,et al.  Variable selection by ensembles for the Cox model , 2011 .

[10]  Muin J. Khoury,et al.  Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes , 2010, BMC Medical Informatics Decis. Mak..

[11]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[12]  Ralf Bender,et al.  Generating survival times to simulate Cox proportional hazards models , 2005, Statistics in medicine.

[13]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[16]  M. Gail,et al.  Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. , 1989, Journal of the National Cancer Institute.

[17]  Norman Breslow,et al.  Discussion of Professor Cox''s paper , 1974 .

[18]  D.,et al.  Regression Models and Life-Tables , 2022 .