Analysis of cancer data: a data mining approach

Abstract: Even though cancer research has traditionally been clinical and biological in nature, in recent years data driven analytic studies have become a common complement. In medical domains where data and analytics driven research is successfully applied, new and novel research directions are identified to further advance the clinical and biological studies. In this research, we used three popular data mining techniques (decision trees, artificial neural networks and support vector machines) along with the most commonly used statistical analysis technique logistic regression to develop prediction models for prostate cancer survivability. The data set contained around 120 000 records and 77 variables. A k-fold cross-validation methodology was used in model building, evaluation and comparison. The results showed that support vector machines are the most accurate predictor (with a test set accuracy of 92.85%) for this domain, followed by artificial neural networks and decision trees.

[1]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[2]  Sung-Bae Cho,et al.  Machine Learning in DNA Microarray Analysis for Cancer Classification , 2003, APBC.

[3]  Leonid Churilov,et al.  Improving risk grouping rules for prostate cancer patients with optimization , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[4]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[5]  Ivan Bratko,et al.  Machine learning for survival analysis: a case study on recurrence of prostate cancer , 2000, Artif. Intell. Medicine.

[6]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[7]  Jules J. Berman,et al.  Confidentiality issues for medical data miners , 2002, Artif. Intell. Medicine.

[8]  Jose C. Principe,et al.  Neural and adaptive systems , 2000 .

[9]  Robert Tibshirani,et al.  Machine learning methods applied to DNA microarray data can improve the diagnosis of cancer , 2003, SKDD.

[10]  William Nick Street,et al.  Breast Cancer Diagnosis and Prognosis Via Linear Programming , 1995, Oper. Res..

[11]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[12]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[13]  Gianna Toffolo,et al.  Mining Data from a Knowledge Management Perspective: An Application to Outcome Prediction in Patients with Resectable Hepatocellular Carcinoma , 2001, AIME.

[14]  Yuehjen E. Shao,et al.  Mining the breast cancer pattern using artificial neural networks and multivariate adaptive regression splines , 2004, Expert Syst. Appl..

[15]  Krzysztof J. Cios,et al.  Uniqueness of medical data mining , 2002, Artif. Intell. Medicine.

[16]  Joseph Y. Lo,et al.  New results in breast cancer classification obtained from an evolutionary computation/adaptive boosting hybrid using mammogram and history data , 2001, SMCia/01. Proceedings of the 2001 IEEE Mountain Workshop on Soft Computing in Industrial Applications (Cat. No.01EX504).

[17]  Korris Fu-Lai Chung,et al.  Using Emerging Pattern Based Projected Clustering and Gene Expression Data for Cancer Detection , 2004, APBC.

[18]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[19]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[20]  Zhen Zhang,et al.  Development of a neural network derived index for early detection of prostate cancer , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[21]  C. K. Mohan,et al.  ClaDia: a fuzzy classifier system for disease diagnosis , 2000, Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512).

[22]  P. H. Sönksen,et al.  Data mining for indicators of early mortality in a database of clinical records , 2001, Artif. Intell. Medicine.

[23]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.