Predicting Metastasis in Breast Cancer: Comparing a Decision Tree with Domain Experts

Breast malignancy is the second most common cause of cancer death among women in Western countries. Identifying high-risk patients is vital in order to provide them with specialized treatment. In some situations, such as when access to experienced oncologists is not possible, decision support methods can be helpful in predicting the recurrence of cancer. Three thousand six hundred ninety-nine breast cancer patients admitted in south-east Sweden from 1986 to 1995 were studied. A decision tree was trained with all patients except for 100 cases and tested with those 100 cases. Two domain experts were asked for their opinions about the probability of recurrence of a certain outcome for these 100 patients. ROC curves, area under the ROC curves, and calibration for predictions were computed and compared. After comparing the predictions from a model built by data mining with predictions made by two domain experts, no significant differences were noted. In situations where experienced oncologists are not available, predictive models created with data mining techniques can be used to support physicians in decision making with acceptable accuracy.

[1]  Zuhair Bandar,et al.  On Producing Balanced Fuzzy Decision Tree Classifiers , 2006, 2006 IEEE International Conference on Fuzzy Systems.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[4]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[5]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[6]  Greenberg,et al.  Age and the Risk of Breast Cancer Recurrence. , 1996, Cancer control : journal of the Moffitt Cancer Center.

[7]  A. Vlahou,et al.  Diagnosis of Ovarian Cancer Using Decision Tree Classification of Mass Spectral Data , 2003, Journal of biomedicine & biotechnology.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9]  S. Wingren,et al.  Incidence and prognosis in early onset breast cancer. , 2002, Breast.

[10]  Ronald A. Cole,et al.  A performance comparison of trained multilayer perceptrons and trained classification trees , 1990 .

[11]  Nada Lavrac,et al.  Selected techniques for data mining in medicine , 1999, Artif. Intell. Medicine.

[12]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[13]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[14]  C. Lokhorst,et al.  Knowledge Discovery in Dutch Dairy Databases , 1998 .

[15]  Vili Podgorelec,et al.  Decision Trees: An Overview and Their Use in Medicine , 2002, Journal of Medical Systems.

[16]  D B Rubin,et al.  Multiple imputation in health-care databases: an overview and some applications. , 1991, Statistics in medicine.

[17]  R. Elledge,et al.  Molecular markers for predicting response to tamoxifen in breast cancer patients , 2000, Endocrine.

[18]  F. Harrell,et al.  Artificial neural networks improve the accuracy of cancer survival prediction , 1997, Cancer.

[19]  B. Nordenskjöld,et al.  Survival after treatment for breast cancer in a geographically defined population , 2004, The British journal of surgery.

[20]  Donato Malerba,et al.  A Comparative Analysis of Methods for Pruning Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  G. Sakorafas,et al.  Risk estimation for breast cancer development; a clinical perspective. , 2002, Surgical oncology.

[22]  Bruno Stiglic,et al.  Does Size Really Matter—Using a Decision Tree Approach for Comparison of Three Different Databases from the Medical Field of Acute Appendicitis , 2004, Journal of Medical Systems.

[23]  David Redden,et al.  A decision tree for tuberculosis contact investigation. , 2002, American journal of respiratory and critical care medicine.

[24]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[25]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[26]  Sten Thorstenson,et al.  Applying the Nottingham Prognostic Index to a Swedish breast cancer population , 2004, Breast Cancer Research and Treatment.

[27]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[28]  Régis Beuscart,et al.  A preprocessing method for improving data mining techniques. Application to a large medical diabetes database , 2003, MIE.

[29]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[30]  Donald E. Brown,et al.  A comparison of decision tree classifiers with backpropagation neural networks for multimodal classification problems , 1992, Pattern Recognit..

[31]  H. Adami,et al.  Survival and recurrences five years after selective treatment for breast carcinoma. , 1978, British Journal of Cancer.

[32]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[33]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[34]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[35]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[36]  Jan L. Talmon,et al.  Neural nets and classification trees: A comparison in the domain of ECG analysis , 1994 .

[37]  M Fieschi,et al.  Medical Decision Support Systems: Old Dilemmas and new Paradigms? , 2003, Methods of Information in Medicine.

[38]  et al.,et al.  Exploring cancer register data to find risk factors for recurrence of breast cancer – application of Canonical Correlation Analysis , 2005, BMC Medical Informatics Decis. Mak..

[39]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[40]  Diego A. Alvarez,et al.  Comparison between logistic regression and neural networks to predict death in patients with suspected sepsis in the emergency room , 2005, Critical care.

[41]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[42]  G. Clark,et al.  S-phase fraction and breast cancer — a decade of experience , 1998 .

[43]  Yuqun Luo,et al.  Information gain for genetic parameter estimation with incorporation of marker data. , 2003, Biometrics.

[44]  Nosrat Shahsavar,et al.  A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining , 2005, AIME.

[45]  Kornelia Polyak,et al.  Very High Frequency of Hypermethylated Genes in Breast Cancer Metastasis to the Bone, Brain, and Lung , 2004, Clinical Cancer Research.