Colon cancer survival prediction using ensemble data mining on SEER data

We analyze the lung cancer data available from the SEER program with the aim of developing accurate survival prediction models for lung cancer. Carefully designed preprocessing steps resulted in removal/modification/splitting of several attributes, and 2 of the 11 derived attributes were found to have significant predictive power. Several supervised classification methods were used on the preprocessed data along with various data mining optimizations and validations. In our experiments, ensemble voting of five decision tree based classifiers and meta-classifiers was found to result in the best prediction performance in terms of accuracy and area under the ROC curve. We have developed an on-line lung cancer outcome calculator for estimating the risk of mortality after 6 months, 9 months, 1 year, 2 year and 5 years of diagnosis, for which a smaller non-redundant subset of 13 attributes was carefully selected using attribute selection techniques, while trying to retain the predictive power of the original set of attributes. Further, ensemble voting models were also created for predicting conditional survival outcome for lung cancer estimating risk of mortality after 5 years of diagnosis, given that the patient has already survived for a period of time, and included in the calculator. The on-line lung cancer outcome calculator developed as a result of this study is available at http://info.eecs.northwestern.edu:8080/LungCancerOutcomeCalculator/.

[1]  H. Joensuu,et al.  Artificial Neural Networks Applied to Survival Prediction in Breast Cancer , 1999, Oncology.

[2]  Erhan Guven,et al.  PREDICTING BREAST CANCER SURVIVABILITY USING DATA MINING TECHNIQUES , 2006 .

[3]  R. M. Dudley,et al.  Special Invited Paper , 2010 .

[4]  Alok N. Choudhary,et al.  Identifying HotSpots in Lung Cancer Data Using Association Rule Mining , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[5]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[6]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  A. A. Safavi,et al.  Predicting breast cancer survivability using data mining techniques , 2010, 2010 2nd International Conference on Software Technology and Engineering.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Ted E. Senator,et al.  Multi-stage classification , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[11]  Cheng Wang,et al.  Decision Tree Based Predictive Models for Breast Cancer Survivability on Imbalanced Data , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[12]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[13]  Manal M. Hassan,et al.  One hundred years after "carcinoid": epidemiology of and prognostic factors for neuroendocrine tumors in 35,825 cases in the United States. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Dean F. Sittig,et al.  Conditional survival in gastric cancer: a SEER database analysis , 2007, Gastric Cancer.

[16]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Dursun Delen,et al.  Predicting breast cancer survivability: a comparison of three data mining methods , 2005, Artif. Intell. Medicine.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Alok N. Choudhary,et al.  A lung cancer outcome calculator using ensemble data mining on SEER data , 2011, BIOKDD '11.

[20]  Xiuzhen Cheng,et al.  Developing Prognostic Systems of Cancer Patients by Ensemble Clustering , 2009, Journal of biomedicine & biotechnology.

[21]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[22]  D. R. Lewis,et al.  Cancer survival and incidence from the Surveillance, Epidemiology, and End Results (SEER) program. , 2003, The oncologist.

[23]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1994 .

[24]  Alok N. Choudhary,et al.  Poster: A lung cancer mortality risk calculator based on SEER data , 2011, 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[25]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[26]  Clifton D Fuller,et al.  Conditional survival in ovarian cancer: results from the SEER dataset 1988-2001. , 2008, Gynecologic oncology.

[27]  Clifton D Fuller,et al.  Conditional Survival in Rectal Cancer: A SEER Database Analysis. , 2007, Gastrointestinal cancer research : GCR.

[28]  Alex Kiss,et al.  Stage-Specific Effect of Adjuvant Therapy Following Gastric Cancer Resection: a Population-based Analysis of 4,041 Patients , 2008, Annals of Surgical Oncology.

[29]  Zhi-Hua Zhou,et al.  Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble , 2003, IEEE Transactions on Information Technology in Biomedicine.

[30]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[31]  F. Huang,et al.  Breast cancer survivability via AdaBoost algorithms , 2008 .

[32]  Josef Kittler,et al.  Combining classifiers: A theoretical framework , 1998, Pattern Analysis and Applications.

[33]  E. Feuer,et al.  Cancer survival among adults: US SEER Program, 1988-2001: patient and tumor characteristics. , 2007 .

[34]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1993 .

[35]  B. Kavanagh,et al.  High incidence of lung cancer after non-muscle-invasive transitional cell carcinoma of the bladder: implications for screening trials. , 2008, Clinical lung cancer.

[36]  Ian Witten,et al.  Data Mining , 2000 .

[37]  Alok N. Choudhary,et al.  Association Rule Mining Based HotSpot Analysis on SEER Lung Cancer Data , 2011, Int. J. Knowl. Discov. Bioinform..