A comparative study of different machine learning methods on microarray gene expression data

BackgroundSeveral classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results.ResultsIn this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers.ConclusionsWe presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[3]  D. Lindley,et al.  Bayes Estimates for the Linear Model , 1972 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[6]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[9]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[10]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[14]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[15]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[16]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[18]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[19]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[20]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[21]  D. Lockhart,et al.  Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  D. Botstein,et al.  Diversity of gene expression in adenocarcinoma of the lung , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[24]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[25]  Björn Olsson,et al.  Artificial intelligence techniques for bioinformatics. , 2002, Applied bioinformatics.

[26]  J. D. Vos,et al.  Comparison of gene expression profiling between malignant and normal plasma cells with oligonucleotide arrays , 2002, Oncogene.

[27]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[28]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[29]  Javed Khan,et al.  Diagnostic Classification of Cancer Using DNA Microarrays and Artificial Intelligence , 2004, Annals of the New York Academy of Sciences.

[30]  James Lyons-Weiler,et al.  caGEDA: a web application for the integrated analysis of global gene expression patterns in cancer , 2004, Applied bioinformatics.

[31]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[32]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[33]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[34]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[35]  Bartek Wilczynski,et al.  Applying dynamic Bayesian networks to perturbed gene expression data , 2006, BMC Bioinformatics.

[36]  F. Azuaje,et al.  Multiple SVM-RFE for gene selection in cancer classification with expression data , 2005, IEEE Transactions on NanoBioscience.

[37]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[38]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[39]  Xin Feng,et al.  Identification of critical genes in microarray experiments by a Neuro-Fuzzy approach , 2006, Comput. Biol. Chem..

[40]  [Genetic regulatory pathway of gene related breast cancer metastasis: primary study by linear differential model and k-means clustering]. , 2006, Zhonghua yi xue za zhi.

[41]  Youping Deng,et al.  SVM Classifier – a comprehensive java interface for support vector machine classification of microarray data , 2006, BMC Bioinformatics.

[42]  Fulvio Gini,et al.  Adaptive and Learning Systems for Signal Processing, Communications, and Control , 2008 .

[43]  Weifeng Liu,et al.  Adaptive and Learning Systems for Signal Processing, Communication, and Control , 2010 .