An entropy-based classification of breast cancerous genes using microarray data

Gene expression levels obtained from microarray data provide a promising technique for doing classification on cancerous data. Due to the high dimensionality of the microarray datasets, the redundant genes need to be removed and only significant genes are required for building the classifier. In this work, an entropy-based method was used based on supervised learning to differentiate between normal tissue and breast tumor based on their gene expression profiles. This work employs four widely used machine learning techniques for breast cancer prediction, namely support vector machine (SVM), random forest, k -nearest neighbor (KNN) and naive Bayes. The performance of these techniques was evaluated on four different classification performance measurements which result in getting more accuracy in case of SVM as compared to other machine learning algorithms. Classification accuracy of 91.5% was achieved by support vector machine with 0.833 F 1 measures. Furthermore, these techniques were evaluated on the basis of performance by ROC curve and calibration graph.

[1]  Md Masud Rana,et al.  Robustification of Naïve Bayes Classifier and Its Application for Microarray Gene Expression Data Analysis , 2017, BioMed research international.

[2]  Kassandra I. Alcaraz,et al.  Cancer statistics for African Americans, 2016: Progress and opportunities in reducing racial disparities , 2016, CA: a cancer journal for clinicians.

[3]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[4]  Enrico Smeraldi,et al.  Neural network analysis in pharmacogenetics of mood disorders , 2004, BMC Medical Genetics.

[5]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[6]  C. Devi Arockia Vanitha,et al.  Gene Expression Data Classification Using Support Vector Machine and Mutual Information-based Gene Selection☆ , 2015 .

[7]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[8]  Bruno H Stricker,et al.  Improving lung cancer survival; time to move on , 2012, BMC Pulmonary Medicine.

[9]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[10]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[11]  C. Furberg,et al.  Effect of drug therapy on survival in chronic congestive heart failure. , 1988, The American journal of cardiology.

[12]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[13]  S. Geisser Selecting a statistical model and predicting , 1993 .

[14]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[15]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[17]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[18]  Verónica Bolón-Canedo,et al.  An ensemble of filters and classifiers for microarray data classification , 2012, Pattern Recognit..

[19]  Syed Mohsin,et al.  Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer , 2003, The Lancet.

[20]  Chhanda Ray,et al.  Cancer Identification and Gene Classification using DNA Microarray Gene Expression Patterns , 2011 .

[21]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[22]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[23]  R. Gelber,et al.  Association of DNA index and S-phase fraction with prognosis of nodes positive early breast cancer. , 1987, Cancer research.

[24]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[25]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[26]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[27]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[28]  Alok J. Saldanha,et al.  Java Treeview - extensible visualization of microarray data , 2004, Bioinform..

[29]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[30]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[31]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[32]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[33]  Jingqin Luo,et al.  Microarray data analysis in neoadjuvant biomarker studies in estrogen receptor-positive breast cancer , 2010, Breast Cancer Research.

[34]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[35]  Edward A. Grens,et al.  Evaluation of parameters for nonlinear thermodynamic models , 1978 .

[36]  Todd H. Stokes,et al.  k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction , 2010, The Pharmacogenomics Journal.