Minimum number of genes for microarray feature selection

A fundamental problem in microarray analysis is to identify relevant genes from large amounts of expression data. Feature selection aims at identifying a subset of features for building robust learning models. However, finding the optimal number of features is a challenging problem, as it is a trade off between information loss when pruning excessively and noise increase when pruning is too weak. This paper presents a novel representation of genes as strings of bits and a method which automatically selects the minimum number of genes to reach a good classification accuracy on the training set. Our method first eliminates redundant features, which do not add further information for classification, then it exploits a set covering algorithm. Preliminary experimental results on public datasets confirm the intuition of the proposed method leading to high classification accuracy.

[1]  E. Baralis,et al.  The Painter's Feature Selection for Gene Expression Data , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[2]  Yijun Sun,et al.  Iterative RELIEF for Feature Weighting: Algorithms, Theories, and Applications , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Ted K. Ralphs,et al.  The Symphony Callable Library for Mixed Integer Programming , 2005 .

[4]  Juan Liu,et al.  A hybrid filter/wrapper gene selection method for microarray classification , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[5]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[6]  Philippe Besse,et al.  Selection of Biologically Relevant Genes with a Wrapper Stochastic Algorithm , 2007, Statistical applications in genetics and molecular biology.

[7]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[8]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[9]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[10]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[12]  S. Merler,et al.  Semisupervised learning for molecular profiling , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[15]  Jan Komorowski,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm486 Data and text mining Monte Carlo , 2022 .