Predicting Breast Cancer in Chinese Women Using Machine Learning Techniques: Algorithm Development

Background Risk-based breast cancer screening is a cost-effective intervention for controlling breast cancer in China, but the successful implementation of such intervention requires an accurate breast cancer prediction model for Chinese women. Objective This study aimed to evaluate and compare the performance of four machine learning algorithms on predicting breast cancer among Chinese women using 10 breast cancer risk factors. Methods A dataset consisting of 7127 breast cancer cases and 7127 matched healthy controls was used for model training and testing. We used repeated 5-fold cross-validation and calculated AUC, sensitivity, specificity, and accuracy as the measures of the model performance. Results The three novel machine-learning algorithms (XGBoost, Random Forest and Deep Neural Network) all achieved significantly higher area under the receiver operating characteristic curves (AUCs), sensitivity, and accuracy than logistic regression. Among the three novel machine learning algorithms, XGBoost (AUC 0.742) outperformed deep neural network (AUC 0.728) and random forest (AUC 0.728). Main residence, number of live births, menopause status, age, and age at first birth were considered as top-ranked variables in the three novel machine learning algorithms. Conclusions The novel machine learning algorithms, especially XGBoost, can be used to develop breast cancer prediction models to help identify women at high risk for breast cancer in developing countries.

[1]  I. Dos-Santos-Silva,et al.  Cost–effectiveness of risk-based breast cancer screening programme, China , 2018, Bulletin of the World Health Organization.

[2]  B. Cowling,et al.  Cost effectiveness of mammography screening for Chinese women , 2007, Cancer.

[3]  Eyal Oren,et al.  Global, regional, and national under-5 mortality, adult mortality, age-specific mortality, and life expectancy, 1970–2016: a systematic analysis for the Global Burden of Disease Study 2016 , 2017, Lancet.

[4]  Wei Lu,et al.  Genetic and clinical predictors for breast cancer risk assessment and stratification among Chinese women. , 2010, Journal of the National Cancer Institute.

[5]  Ammarin Thakkinstian,et al.  Development and validation of a breast cancer risk prediction model for Thai women: a cross-sectional study. , 2014, Asian Pacific journal of cancer prevention : APJCP.

[6]  Kei-Hoi Cheung,et al.  Predicting urinary tract infections in the emergency department with machine learning , 2018, PloS one.

[7]  Cha Zhang,et al.  Ensemble Machine Learning: Methods and Applications , 2012 .

[8]  Y. Qiao,et al.  Risk Factors for Breast Cancer Among Chinese Women: A 10-Year Nationwide Multicenter Cross-Sectional Study , 2013, Journal of epidemiology.

[9]  L. McCullough,et al.  Comparison of Breast Cancer Risk Predictive Models and Screening Strategies for Chinese Women. , 2017, Journal of women's health.

[10]  Cha Zhang,et al.  Ensemble Machine Learning , 2012 .

[11]  Fei Wang,et al.  Risk factors for breast cancer in women residing in urban and rural areas of eastern China , 2015, The Journal of international medical research.

[12]  Bjoern H. Menze,et al.  A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data , 2009, BMC Bioinformatics.

[13]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[14]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[15]  G A Colditz,et al.  Nurses' health study: log-incidence mathematical model of breast cancer incidence. , 1996, Journal of the National Cancer Institute.

[16]  F W MULSOW,et al.  Cancer incidence and mortality , 2019, Health at a Glance: Europe.

[17]  Weiguo Gao,et al.  Treatment and survival patterns of Chinese patients diagnosed with breast cancer between 2005 and 2009 in Southwest China , 2016, Medicine.

[18]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[19]  Assessment of performance of the Gail model for predicting breast cancer risk: a systematic review and meta-analysis with trial sequential analysis , 2018, Breast Cancer Research.

[20]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[21]  G. Giles,et al.  Breast Cancer Risk Prediction Using Clinical Models and 77 Independent Risk-Associated SNPs for Women Aged Under 50 Years: Australian Breast Cancer Family Registry , 2015, Cancer Epidemiology, Biomarkers & Prevention.

[22]  C. Jia,et al.  Risk factors related to female breast cancer in regions of Northeast China: a 1:3 matched case-control population-based study. , 2012, Chinese medical journal.

[23]  J Benichou,et al.  Validation studies for models projecting the risk of invasive and total breast cancer incidence. , 1999, Journal of the National Cancer Institute.

[24]  S. Cummings,et al.  Breast cancer risk prediction using a clinical risk model and polygenic risk score , 2016, Breast Cancer Research and Treatment.

[25]  Jie He,et al.  Cancer incidence and mortality in China, 2014. , 2018, Chinese journal of cancer research = Chung-kuo yen cheng yen chiu.

[26]  G. Colditz,et al.  Validation of Rosner–Colditz breast cancer incidence model using an independent data set, the California Teachers Study , 2013, Breast Cancer Research and Treatment.

[27]  Charu C. Aggarwal,et al.  Neural Networks and Deep Learning , 2018, Springer International Publishing.

[28]  S. Duffy,et al.  Effect of mammographic screening from age 40 years on breast cancer mortality in the UK Age trial at 17 years' follow-up: a randomised controlled trial. , 2015, The Lancet. Oncology.

[29]  Jiayuan Li,et al.  Predicting Breast Cancer in Chinese Women Using Machine Learning Techniques: Algorithm Development (Preprint) , 2019 .

[30]  Antonio Reverter,et al.  Genomic Prediction of Breeding Values Using a Subset of SNPs Identified by Three Machine Learning Methods , 2018, Front. Genet..

[31]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[32]  A. Miller,et al.  Impact of screening mammography on mortality from breast cancer before age 60 in women 40 to 49 years of age. , 2014, Current oncology.

[33]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[34]  Julietta Patnick,et al.  Breast and cervical cancer screening programme implementation in 16 countries , 2010, Journal of medical screening.