Identifying recurrent breast cancer patients in national health registries using machine learning

Abstract Background More than 4500 women are diagnosed with breast cancer each year in Denmark, however, despite adequate treatment 10–30% of patients will experience a recurrence. The Danish Breast Cancer Group (DBCG) stores information on breast cancer recurrence but to improve data completeness automated identification of patients with recurrence is needed. Methods We included patient data from the DBCG, the National Pathology Database, and the National Patient Registry for patients with an invasive breast cancer diagnosis after 1999. In total, relevant features of 79,483 patients with a definitive surgery were extracted. A machine learning (ML) model was trained, using a simplistic encoding scheme of features, on a development sample covering 5333 patients with known recurrence and three times as many non-recurrent women. The model was validated in a validation sample consisting of 1006 patients with unknown recurrence status. Results The ML model identified patients with recurrence with AUC-ROC of 0.93 (95% CI: 0.93–0.94) in the development, and an AUC-ROC of 0.86 (95% CI: 0.83–0.88) in the validation sample. Conclusion An off-the-shelf ML model, trained using the simplistic encoding scheme, could identify recurrence patients across multiple national registries. This approach might potentially enable researchers and clinicians to better and faster identify patients with recurrence and reduce manual patient data interpretation.

[1]  M. Nørgaard,et al.  The Incidence of Breast Cancer Recurrence 10-32 Years After Primary Diagnosis , 2021, Journal of the National Cancer Institute.

[2]  Jennifer L. Caswell-Jin,et al.  Change in Survival in Metastatic Breast Cancer with Treatment Advances: Meta-Analysis and Systematic Review , 2018, JNCI cancer spectrum.

[3]  Daniel F. Hayes,et al.  20‐Year Risks of Breast‐Cancer Recurrence after Stopping Endocrine Therapy at 5 Years , 2017, The New England journal of medicine.

[4]  T. Ahern,et al.  Validity of Danish Breast Cancer Group (DBCG) registry data used in the predictors of breast cancer recurrence (ProBeCaRe) premenopausal breast cancer cohort study , 2017, Acta oncologica.

[5]  H. Mouridsen,et al.  Improvements in breast cancer survival between 1995 and 2012 in Denmark: The importance of earlier diagnosis and adjuvant treatment , 2016, Acta oncologica.

[6]  H. Mouridsen,et al.  [Danish Breast Cancer Cooperative Group]. , 2012, Ugeskrift for laeger.

[7]  B. Bjerregaard,et al.  The Danish Pathology Register , 2011, Scandinavian journal of public health.

[8]  C. Pedersen,et al.  The Danish Civil Registration System , 2011, Scandinavian journal of public health.

[9]  Elsebeth Lynge,et al.  The Danish National Patient Register , 2011, Scandinavian journal of public health.

[10]  Jacques Ferlay,et al.  NORDCAN – a Nordic tool for cancer information, planning, quality control and research , 2010, Acta oncologica.

[11]  H. Mouridsen,et al.  Danish Breast Cancer Cooperative Group – DBCG: History, organization, and status of scientific achievements at 30-year anniversary , 2008, Acta oncologica.

[12]  L. Breiman Random Forests , 2001, Encyclopedia of Machine Learning and Data Mining.

[13]  M Schemper,et al.  A note on quantifying follow-up in studies of failure time. , 1996, Controlled clinical trials.

[14]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.