Colon cancer survival prediction using ensemble data mining on SEER data

We analyze the colon cancer data available from the SEER program with the aim of developing accurate survival prediction models for colon cancer. Carefully designed preprocessing steps resulted in removal of several attributes and applying several supervised classification methods. We also adopt synthetic minority over-sampling technique (SMOTE) to balance the survival and non-survival classes we have. In our experiments, ensemble voting of the three of the top performing classifiers was found to result in the best prediction performance in terms of prediction accuracy and area under the ROC curve. We evaluated multiple classification schemes to estimate the risk of mortality after 1 year, 2 years and 5 years of diagnosis, on a subset of 65 attributes after the data clean up process, 13 attribute carefully selected using attribute selection techniques, and SMOTE balanced set of the same 13 attributes, while trying to retain the predictive power of the original set of attributes. Moreover, we demonstrate the importance of balancing the classes of the data set to yield better results.

[1]  Alok N. Choudhary,et al.  Poster: A lung cancer mortality risk calculator based on SEER data , 2011, 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[2]  E. Somers International Agency for Research on Cancer. , 1985, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[3]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[4]  Erhan Guven,et al.  PREDICTING BREAST CANCER SURVIVABILITY USING DATA MINING TECHNIQUES , 2006 .

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  Xiuzhen Cheng,et al.  Developing Prognostic Systems of Cancer Patients by Ensemble Clustering , 2009, Journal of biomedicine & biotechnology.

[7]  Hiroshi Tanaka,et al.  Comparison of Seven Algorithms to Predict Breast Cancer Survival( Contribution to 21 Century Intelligent Technologies and Bioinformatics) , 2008 .

[8]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[9]  F. Huang,et al.  Breast cancer survivability via AdaBoost algorithms , 2008 .

[10]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1993 .

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Ted E. Senator,et al.  Multi-stage classification , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[13]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[16]  Alok N. Choudhary,et al.  Association Rule Mining Based HotSpot Analysis on SEER Lung Cancer Data , 2011, Int. J. Knowl. Discov. Bioinform..

[17]  Josef Kittler,et al.  Combining classifiers: A theoretical framework , 1998, Pattern Analysis and Applications.

[18]  Manal M. Hassan,et al.  One hundred years after "carcinoid": epidemiology of and prognostic factors for neuroendocrine tumors in 35,825 cases in the United States. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[19]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[20]  D. R. Lewis,et al.  Cancer survival and incidence from the Surveillance, Epidemiology, and End Results (SEER) program. , 2003, The oncologist.

[21]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[22]  Ian Witten,et al.  Data Mining , 2000 .

[23]  Clifton D Fuller,et al.  Conditional Survival in Rectal Cancer: A SEER Database Analysis. , 2007, Gastrointestinal cancer research : GCR.

[24]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[25]  H. Joensuu,et al.  Artificial Neural Networks Applied to Survival Prediction in Breast Cancer , 1999, Oncology.

[26]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  E. Feuer,et al.  Cancer survival among adults: US SEER Program, 1988-2001: patient and tumor characteristics. , 2007 .

[28]  Jacques Ferlay,et al.  Cancer incidence in five continents. , 1976, IARC scientific publications.

[29]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[30]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[31]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[32]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[33]  Alok N. Choudhary,et al.  A lung cancer outcome calculator using ensemble data mining on SEER data , 2011, BIOKDD '11.

[34]  Dean F. Sittig,et al.  Conditional survival in gastric cancer: a SEER database analysis , 2007, Gastric Cancer.

[35]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[36]  B. Kavanagh,et al.  High incidence of lung cancer after non-muscle-invasive transitional cell carcinoma of the bladder: implications for screening trials. , 2008, Clinical lung cancer.

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[39]  R. M. Dudley,et al.  Special Invited Paper , 2010 .

[40]  Clifton D Fuller,et al.  Conditional survival in ovarian cancer: results from the SEER dataset 1988-2001. , 2008, Gynecologic oncology.

[41]  Sherif Kassem Fathy,et al.  A predication survival model for colorectal cancer , 2011 .

[42]  Zhi-Hua Zhou,et al.  Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble , 2003, IEEE Transactions on Information Technology in Biomedicine.

[43]  Alok N. Choudhary,et al.  Identifying HotSpots in Lung Cancer Data Using Association Rule Mining , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[44]  Alex Kiss,et al.  Stage-Specific Effect of Adjuvant Therapy Following Gastric Cancer Resection: a Population-based Analysis of 4,041 Patients , 2008, Annals of Surgical Oncology.

[45]  S. Steele,et al.  Clinical Decision Support and Individualized Prediction of Survival in Colon Cancer: Bayesian Belief Network Model , 2012, Annals of Surgical Oncology.

[46]  Cheng Wang,et al.  Decision Tree Based Predictive Models for Breast Cancer Survivability on Imbalanced Data , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[47]  J A TAIANA,et al.  [Cancer of the lung]. , 1952, The Journal of the International College of Surgeons.