New feature subset selection procedures for classification of expression profiles

BackgroundMethods for extracting useful information from the datasets produced by microarray experiments are at present of much interest. Here we present new methods for finding gene sets that are well suited for distinguishing experiment classes, such as healthy versus diseased tissues. Our methods are based on evaluating genes in pairs and evaluating how well a pair in combination distinguishes two experiment classes. We tested the ability of our pair-based methods to select gene sets that generalize the differences between experiment classes and compared the performance relative to two standard methods. To assess the ability to generalize class differences, we studied how well the gene sets we select are suited for learning a classifier.ResultsWe show that the gene sets selected by our methods outperform the standard methods, in some cases by a large margin, in terms of cross-validation prediction accuracy of the learned classifier. We show that on two public datasets, accurate diagnoses can be made using only 15-30 genes. Our results have implications for how to select marker genes and how many gene measurements are needed for diagnostic purposes.ConclusionWhen looking for differential expression between experiment classes, it may not be sufficient to look at each gene in a separate universe. Evaluating combinations of genes reveals interesting information that will not be discovered otherwise. Our results show that class prediction can be improved by taking advantage of this extra information.

[1]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[2]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[8]  E. Boerwinkle,et al.  Computational methods for gene expression-based tumor classification. , 2000, BioTechniques.

[9]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[10]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[11]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Inge Jonassen,et al.  J-Express: exploring gene expression data using Java , 2001, Bioinform..

[13]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[14]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[15]  I. Mian,et al.  Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. , 2001, Physiological genomics.

[16]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .