Ensemble gene selection for cancer classification

Cancer diagnosis is an important emerging clinical application of microarray data. Its accurate prediction to the type or size of tumors relies on adopting powerful and reliable classification models, so as to patients can be provided with better treatment or response to therapy. However, the high dimensionality of microarray data may bring some disadvantages, such as over-fitting, poor performance and low efficiency, to traditional classification models. Thus, one of the challenging tasks in cancer diagnosis is how to identify salient expression genes from thousands of genes in microarray data that can directly contribute to the phenotype or symptom of disease. In this paper, we propose a new ensemble gene selection method (EGS) to choose multiple gene subsets for classification purpose, where the significant degree of gene is measured by conditional mutual information or its normalized form. After different gene subsets have been obtained by setting different starting points of the search procedure, they will be used to train multiple base classifiers and then aggregated into a consensus classifier by the manner of majority voting. The proposed method is compared with five popular gene selection methods on six public microarray datasets and the comparison results show that our method works well.

[1]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[2]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[3]  Seon-Young Kim,et al.  Gene-set approach for expression pattern analysis , 2008, Briefings Bioinform..

[4]  David A. Bell,et al.  A Formalism for Relevance and Its Application in Feature Subset Selection , 2000, Machine Learning.

[5]  Yang Wang,et al.  Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..

[6]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[7]  Kaushik Mahata,et al.  Selecting differentially expressed genes using minimum probability of classification error , 2007, J. Biomed. Informatics.

[8]  Kari Torkkola,et al.  Information-Theoretic Methods , 2006, Feature Extraction.

[9]  Melanie Hilario,et al.  Approaches to dimensionality reduction in proteomic biomarker studies , 2007, Briefings Bioinform..

[10]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[11]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[12]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[14]  Hong-Wen Deng,et al.  Gene selection for classification of microarray data based on the Bayes error , 2007, BMC Bioinformatics.

[15]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[16]  Mineichi Kudo,et al.  Non-parametric classifier-independent feature selection , 2006, Pattern Recognit..

[17]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[18]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[19]  Colas Schretter,et al.  Information-Theoretic Feature Selection in Microarray Data Using Variable Complementarity , 2008, IEEE Journal of Selected Topics in Signal Processing.

[20]  Miguel Cazorla,et al.  Feature selection, mutual information, and the classification of high-dimensional patterns , 2008, Pattern Analysis and Applications.

[21]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[22]  Tommy W. S. Chow,et al.  Effective Gene Selection Method With Small Sample Sets Using Gradient-Based and Point Injection Techniques , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Jinn-Yi Yeh,et al.  Applying Data Mining Techniques for Cancer Classification from Gene Expression Data , 2007 .

[24]  Oleg Okun,et al.  Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors , 2009, Artif. Intell. Medicine.

[25]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[26]  Jan Komorowski,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm486 Data and text mining Monte Carlo , 2022 .

[27]  Sung-Bae Cho,et al.  An Evolutionary Algorithm Approach to Optimal Ensemble Classifiers for DNA Microarray Data Analysis , 2008, IEEE Transactions on Evolutionary Computation.

[28]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[29]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[30]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[31]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[32]  Xia Li,et al.  Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling. , 2004, Nucleic acids research.

[33]  Mykola Pechenizkiy,et al.  Diversity in search strategies for ensemble feature selection , 2005, Inf. Fusion.

[34]  Kagan Tumer,et al.  Classifier ensembles: Select real-world applications , 2008, Inf. Fusion.

[35]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[36]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[37]  Zne-Jung Lee,et al.  An integrated algorithm for gene selection and classification applied to microarray data of ovarian cancer , 2008, Artif. Intell. Medicine.

[38]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[39]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[40]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[41]  Shimon Ullman,et al.  Learning to classify by ongoing feature selection , 2010, Image Vis. Comput..

[42]  Antanas Verikas,et al.  A feature selection technique for generation of classification committees and its application to categorization of laryngeal images , 2009, Pattern Recognit..

[43]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[44]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[45]  Paul A. Viola,et al.  Boosting Image Retrieval , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[46]  Lei Yu,et al.  Feature Selection for Genomic Data Analysis , 2007 .

[47]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[48]  Huan Liu Feature Selection , 2010, Encyclopedia of Machine Learning.

[49]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[50]  Sung-Bae Cho,et al.  Ensemble classifiers based on correlation analysis for DNA microarray classification , 2006, Neurocomputing.

[51]  Xin Zhou,et al.  MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data , 2007, Bioinform..

[52]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[53]  Sung-Bae Cho,et al.  The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming , 2006, Artif. Intell. Medicine.

[54]  Sung-Bae Cho,et al.  Cancer classification using ensemble of neural networks with multiple significant gene subsets , 2007, Applied Intelligence.

[55]  M. Daumer,et al.  Evaluating Microarray-based Classifiers: An Overview , 2008, Cancer informatics.

[56]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[57]  James J. Chen,et al.  Ensemble methods for classification of patients for personalized medicine with high-dimensional data , 2007, Artif. Intell. Medicine.

[58]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[59]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[60]  Lei Liu,et al.  Feature selection with dynamic mutual information , 2009, Pattern Recognit..

[61]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[62]  Feng Chu,et al.  A General Wrapper Approach to Selection of Class-Dependent Features , 2008, IEEE Transactions on Neural Networks.

[63]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.