Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines

This study proposes a novel prediction approach for human breast and colon cancers using different feature spaces. The proposed scheme consists of two stages: the preprocessor and the predictor. In the preprocessor stage, the mega-trend diffusion (MTD) technique is employed to increase the samples of the minority class, thereby balancing the dataset. In the predictor stage, machine-learning approaches of K-nearest neighbor (KNN) and support vector machines (SVM) are used to develop hybrid MTD-SVM and MTD-KNN prediction models. MTD-SVM model has provided the best values of accuracy, G-mean and Matthew's correlation coefficient of 96.71%, 96.70% and 71.98% for cancer/non-cancer dataset, breast/non-breast cancer dataset and colon/non-colon cancer dataset, respectively. We found that hybrid MTD-SVM is the best with respect to prediction performance and computational cost. MTD-KNN model has achieved moderately better prediction as compared to hybrid MTD-NB (Naïve Bayes) but at the expense of higher computing cost. MTD-KNN model is faster than MTD-RF (random forest) but its prediction is not better than MTD-RF. To the best of our knowledge, the reported results are the best results, so far, for these datasets. The proposed scheme indicates that the developed models can be used as a tool for the prediction of cancer. This scheme may be useful for study of any sequential information such as protein sequence or any nucleic acid sequence.

[1]  A. Balaban,et al.  Topological Indices and Related Descriptors in QSAR and QSPR , 2003 .

[2]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[3]  Vadlamani Ravi,et al.  Colon cancer prediction with genetic profiles using intelligent techniques , 2008, Bioinformation.

[4]  Rudolf Kruse,et al.  Obtaining interpretable fuzzy classification rules from medical data , 1999, Artif. Intell. Medicine.

[5]  Nicolás García Aracil,et al.  Ultrasound based application for intraglandular mapping of breast cancer , 2013, Comput. Methods Programs Biomed..

[6]  M. Cevdet Ince,et al.  An expert system for detection of breast cancer based on association rules and neural network , 2009, Expert Syst. Appl..

[7]  Der-Chiang Li,et al.  Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge , 2007, Comput. Oper. Res..

[8]  Zong Dai,et al.  Prediction of protein structural classes by Chou’s pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis , 2009, Amino Acids.

[9]  Jeon-Hor Chen,et al.  Computer-aided diagnosis of mass-like lesion in breast MRI: Differential analysis of the 3-D morphology between benign and malignant tumors , 2013, Comput. Methods Programs Biomed..

[10]  Thesis Faculty of Computer Science and Management , 2013 .

[11]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  Xia Li,et al.  Comparative evaluation of support vector machines for computer aided diagnosis of lung cancer in CT based on a multi-dimensional data set , 2013, Comput. Methods Programs Biomed..

[14]  Hasan Ogul,et al.  SVM-based detection of distant protein structural relationships using pairwise probabilistic suffix trees , 2006, Comput. Biol. Chem..

[15]  Victor Trevino,et al.  Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm , 2010, Comput. Biol. Chem..

[16]  E. Uriarte,et al.  Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices , 2008, Journal of Theoretical Biology.

[17]  Abbas Toloie Eshlaghy,et al.  Using Three Machine Learning Techniques for Predicting Breast Cancer Recurrence , 2013 .

[18]  Hassan Mohabatkar,et al.  Prediction of cyclin proteins using Chou's pseudo amino acid composition. , 2010, Protein and peptide letters.

[19]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[20]  K. Chou,et al.  Prediction of membrane protein types and subcellular locations , 1999, Proteins.

[21]  Yu-Dong Cai,et al.  Prediction of protein function in the absence of significant sequence similarity. , 2004, Current medicinal chemistry.

[22]  Ganapati Panda,et al.  A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction , 2010, Comput. Biol. Chem..

[23]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[24]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[25]  Shao-Ping Shi,et al.  Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. , 2010, Protein and peptide letters.

[26]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[27]  Asifullah Khan,et al.  Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. , 2011, Journal of theoretical biology.

[28]  Off line? , 2007, BMJ : British Medical Journal.

[29]  Huseyin Seker,et al.  Assessment of nodal involvement and survival analysis in breast cancer patients using image cytometric data: statistical, neural network and fuzzy approaches. , 2002, Anticancer research.

[30]  Kourosh Mozafari,et al.  Hepatitis disease diagnosis using a novel hybrid method based on support vector machine and simulated annealing (SVM-SA) , 2012, Comput. Methods Programs Biomed..

[31]  Michael Oellerich,et al.  Potential Biomarkers in the Sera of Breast Cancer Patients from Bahawalpur, Pakistan , 2012, Biomarkers in cancer.

[32]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[33]  Horst Bunke,et al.  Off-Line, Handwritten Numeral Recognition by Perturbation Method , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Mann A. Shoffner,et al.  Application of backpropagation neural networks to diagnosis of breast and ovarian cancer. , 1994, Cancer letters.

[35]  Ming-Yih Lee,et al.  Entropy-based feature extraction and decision tree induction for breast cancer diagnosis with standardized thermograph images , 2010, Comput. Methods Programs Biomed..

[36]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[37]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[38]  R. Simes,et al.  Treatment selection for cancer patients: application of statistical decision theory to the treatment of advanced ovarian cancer. , 1985, Journal of chronic diseases.

[39]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[40]  Nick Pacf,et al.  Protein and peptide letters: editors Ben Dunn and Laurence Pearl, Bentham Science Publishers B.V., $60.00 (individual); $155.00 (institutional) , 1995 .

[41]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[42]  John Rand,et al.  Using neural networks to diagnose cancer , 1991, Journal of Medical Systems.

[43]  Yu-Chuan Li,et al.  Critical laboratory result reporting system in cancer patients , 2013, Comput. Methods Programs Biomed..

[44]  Gang Huang,et al.  intelligent decision support algorithm for diagnosis of olorectal cancer through serum tumor markers , 2010 .

[45]  Hao Lin The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. , 2008, Journal of theoretical biology.

[46]  Guandong Xu,et al.  Tumor tissue identification based on gene expression data using DWT feature extraction and PNN classifier , 2006, Neurocomputing.

[47]  Amin Einipour A Fuzzy-ACO Method for Detect Breast Cancer , 2011 .

[48]  E. Uriarte,et al.  Discriminating prostate cancer patients from control group with connectivity indices , 2007 .

[49]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[50]  Tae-Sun Choi,et al.  Predicting protein subcellular location: exploiting amino acid based sequence of feature spaces and fusion of diverse classifiers , 2009, Amino Acids.

[51]  Anirban Mukherjee,et al.  Cancer Classification from Gene Expression Data by NPPC Ensemble , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[52]  Asifullah Khan,et al.  CE-PLoc: An ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition , 2011, Comput. Biol. Chem..

[53]  Peixiang Cai,et al.  Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network. , 2006, Analytical biochemistry.

[54]  E. Uriarte,et al.  Using spectral moments of spiral networks based on PSA/mass spectra outcomes to derive quantitative proteome-disease relationships (QPDRs) and predicting prostate cancer. , 2008, Biochemical and biophysical research communications.

[55]  Jianding Qiu,et al.  Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. , 2010, Protein and peptide letters.

[56]  Chi-Kan Chen,et al.  The classification of cancer stage microarray data , 2012, Comput. Methods Programs Biomed..

[57]  Kemal Polat,et al.  A new hybrid method based on fuzzy-artificial immune system and k-nn algorithm for breast cancer diagnosis , 2007, Comput. Biol. Medicine.

[58]  J. Listgarten,et al.  Predictive Models for Breast Cancer Susceptibility from Multiple Single Nucleotide Polymorphisms , 2004, Clinical Cancer Research.

[59]  Adem Kalinli,et al.  Performance comparison of machine learning methods for prognosis of hormone receptor status in breast cancer tissue samples , 2013, Comput. Methods Programs Biomed..