A rank sum test method for informative gene discovery

Finding informative genes from microarray data is an important research problem in bioinformatics research and applications. Most of the existing methods rank features according to their discriminative capability and then find a subset of discriminative genes (usually top k genes). In particular, t-statistic criterion and its variants have been adopted extensively. This kind of methods rely on the statistics principle of t-test, which requires that the data follows a normal distribution. However, according to our investigation, the normality condition often cannot be met in real data sets.To avoid the assumption of the normality condition, in this paper, we propose a rank sum test method for informative gene discovery. The method uses a rank-sum statistic as the ranking criterion. Moreover, we propose using the significance level threshold, instead of the number of informative genes, as the parameter. The significance level threshold as a parameter carries the quality specification in statistics. We follow the Pitman efficiency theory to show that the rank sum method is more accurate and more robust than the t-statistic method in theory.To verify the effectiveness of the rank sum method, we use support vector machine (SVM) to construct classifiers based on the identified informative genes on two well known data sets, namely colon data and leukemia data. The prediction accuracy reaches 96.2% on the colon data and 100% on the leukemia data. The results are clearly better than those from the previous feature ranking methods. By experiments, we also verify that using significance level threshold is more effective than directly specifying an arbitrary k.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[3]  E. Lehmann,et al.  Nonparametrics: Statistical Methods Based on Ranks , 1976 .

[4]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[5]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[6]  H. Toutenburg,et al.  Lehmann, E. L., Nonparametrics: Statistical Methods Based on Ranks, San Francisco. Holden‐Day, Inc., 1975. 480 S., $ 22.95 . , 1977 .

[7]  H. Yu,et al.  Discovering compact and highly discriminative features or combinations of drug activities using support vector machines , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[8]  Chris H. Q. Ding,et al.  Analysis of gene expression profiles: class discovery and leaf ordering , 2002, RECOMB '02.

[9]  Jian Pei,et al.  Mining Coherent Gene Clusters from Three-Dimensional Microarray Data ⁄ , 2004 .

[10]  L. Penland,et al.  Use of a cDNA microarray to analyse gene expression patterns in human cancer , 1996, Nature Genetics.

[11]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Jill P. Mesirov,et al.  Class prediction and discovery using gene expression data , 2000, RECOMB '00.

[13]  Ivo Grosse,et al.  Gene selection criterion for discriminant microarray data analysis based on extreme value distributions , 2003, RECOMB '03.

[14]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[15]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[16]  Jian Pei,et al.  Mining coherent gene clusters from gene-sample-time microarray data , 2004, KDD.

[17]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[18]  William Fuller Brown,et al.  Methods of Statistical Analysis , 1939 .

[19]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.