Predicting breast cancer survivability: a comparison of three data mining methods

OBJECTIVE The prediction of breast cancer survivability has been a challenging research problem for many researchers. Since the early dates of the related research, much advancement has been recorded in several related fields. For instance, thanks to innovative biomedical technologies, better explanatory prognostic factors are being measured and recorded; thanks to low cost computer hardware and software technologies, high volume better quality data is being collected and stored automatically; and finally thanks to better analytical methods, those voluminous data is being processed effectively and efficiently. Therefore, the main objective of this manuscript is to report on a research project where we took advantage of those available technological advancements to develop prediction models for breast cancer survivability. METHODS AND MATERIAL We used two popular data mining algorithms (artificial neural networks and decision trees) along with a most commonly used statistical method (logistic regression) to develop the prediction models using a large dataset (more than 200,000 cases). We also used 10-fold cross-validation methods to measure the unbiased estimate of the three prediction models for performance comparison purposes. RESULTS The results indicated that the decision tree (C5) is the best predictor with 93.6% accuracy on the holdout sample (this prediction accuracy is better than any reported in the literature), artificial neural networks came out to be the second with 91.2% accuracy and the logistic regression models came out to be the worst of the three with 89.2% accuracy. CONCLUSION The comparative study of multiple prediction models for breast cancer survivability using a large dataset along with a 10-fold cross-validation provided us with an insight into the relative prediction ability of different data mining methods. Using sensitivity analysis on neural network models provided us with the prioritized importance of the prognostic factors used in the study.

[1]  Hussein A. Abbass,et al.  An evolutionary artificial neural networks approach for breast cancer diagnosis , 2002, Artif. Intell. Medicine.

[2]  Nada Lavrac,et al.  Selected techniques for data mining in medicine , 1999, Artif. Intell. Medicine.

[3]  P. H. Sönksen,et al.  Data mining for indicators of early mortality in a database of clinical records , 2001, Artif. Intell. Medicine.

[4]  Parag C. Pendharkar,et al.  Association, statistical, mathematical and neural approaches for mining breast cancer patterns , 1999 .

[5]  H Brenner,et al.  A computer program for period analysis of cancer patient survival. , 2002, European journal of cancer.

[6]  Nicolette de Keizer,et al.  Integrating classification trees with local logistic regression in Intensive Care prognosis , 2003, Artif. Intell. Medicine.

[7]  C. Giardina,et al.  Prognostic factors in breast cancer: the predictive value of the Nottingham Prognostic Index in patients with a long-term follow-up that were treated in a single institution. , 2001, European journal of cancer.

[8]  D. Bostwick,et al.  Prediction of individual patient outcome in cancer , 2001, Cancer.

[9]  Bruce Ham,et al.  Breast cancer severity score is an innovative system for prognosis. , 2003, American journal of surgery.

[10]  Philip H. Goodman,et al.  Comparing the prediction accuracy of artifical neural networks and other statistical models for breast cancer survival , 1994, NIPS.

[11]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[12]  F. Harrell,et al.  Artificial neural networks improve the accuracy of cancer survival prediction , 1997, Cancer.

[13]  D. West,et al.  Socioeconomic status and breast carcinoma survival in four racial/ethnic groups , 2003, Cancer.

[14]  B. Hankey,et al.  The surveillance, epidemiology, and end results program: a national resource. , 1999, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[15]  Krzysztof J. Cios,et al.  Uniqueness of medical data mining , 2002, Artif. Intell. Medicine.

[16]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[17]  Gustavo Santos-García,et al.  Prediction of postoperative morbidity after lung resection using an artificial neural network ensemble , 2004, Artif. Intell. Medicine.

[18]  Ahmedin Jemal,et al.  Annual Report to the Nation on the status of cancer, 1973–1999, featuring implications of age and aging on U.S. cancer burden , 2002, Cancer.

[19]  Jose C. Principe,et al.  Neural and adaptive systems , 2000 .

[20]  N. Bundred,et al.  Prognostic and predictive factors in breast cancer. , 2001, Cancer treatment reviews.

[21]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[22]  José Antonio Gómez-Ruiz,et al.  A combined neural network and decision trees model for prognosis of breast cancer relapse , 2003, Artif. Intell. Medicine.

[23]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[24]  H. Joensuu,et al.  Artificial Neural Networks Applied to Survival Prediction in Breast Cancer , 1999, Oncology.

[25]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[26]  Jules J. Berman,et al.  Confidentiality issues for medical data miners , 2002, Artif. Intell. Medicine.

[27]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[28]  Lucila Ohno-Machado,et al.  Journal of Biomedical Informatics , 2002 .

[29]  H M Rosenberg,et al.  Annual report to the nation on the status of cancer (1973 through 1998), featuring cancers with recent increasing trends. , 2001, Journal of the National Cancer Institute.

[30]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[31]  W. Willett,et al.  Breast cancer (2). , 1992, The New England journal of medicine.

[32]  I. Ellis,et al.  Prognostic and predictive factors in primary breast cancer and their role in patient management: The Nottingham Breast Team. , 2001, European journal of surgical oncology : the journal of the European Society of Surgical Oncology and the British Association of Surgical Oncology.