A novel approach to select important genes from microarray data

Feature subset selection is a well-known pattern recognition problem, which aims to reduce the number of features used in classification or recognition. This reduction is expected to improve the performance of classification algorithms in terms of speed, accuracy and simplicity. Most existing feature selection investigations are not suitable for microarray data, so this paper focuses on gene selection problem. The main contributions of this paper are that a new feature selection method A-score is introduced and constructed an improved fuzzy Bayesian classifier. We evaluate the performance of A-score using three well-known benchmark data sets: the iris data, the wine data, and the Wisconsin breast cancer data and two microarray data: ALL-AML Leukemia and colon cancer. In general, A-score can significantly reduce the number of genes, and perform better than T-score and C-score.

[1]  Ian Witten,et al.  Data Mining , 2000 .

[2]  Joseph E. Cavanaugh Statistics: The Exploration and Analysis of Data (5th ed.), Jay L. Devore and Roxy Peck , 2007 .

[3]  J. Devore,et al.  Statistics: The Exploration and Analysis of Data , 1986 .

[4]  I. Jolliffe Principal Component Analysis , 2002 .

[5]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[6]  Edward R. Dougherty,et al.  Feature selection algorithms to find strong genes , 2005, Pattern Recognit. Lett..

[7]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[8]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[9]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[10]  Liang Chen,et al.  A statistical method for identifying differential gene-gene co-expression patterns , 2004, Bioinform..

[11]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[13]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[14]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[15]  Xizhao Wang,et al.  OFFSS: optimal fuzzy-valued feature subset selection , 2003, IEEE Trans. Fuzzy Syst..

[16]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[17]  Alex Lewin,et al.  A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments , 2004, Bioinform..

[18]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Justin C. W. Debuse,et al.  Feature Subset Selection within a Simulated Annealing Data Mining Algorithm , 1997, Journal of Intelligent Information Systems.

[20]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .