An Improved Method on Wilcoxon Rank Sum Test for Gene Selection from Microarray Experiments

Selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. In this paper, we propose a flexible rank-based nonparametric procedure for gene selection from microarray data. In the method we propose a statistic for testing whether area under receiver operating characteristic curve (AUC) for each gene is equal to 0.5 allowing different variance for each gene. The contribution to this “single gene” statistic is the studentization of the empirical AUC, which takes into account the variances associated with each gene in the experiment. Delong et al. proposed a nonparametric procedure for calculating a consistent variance estimator of the AUC. We use their variance estimation technique to get a test statistic, and we focus on the primary step in the gene selection process, namely, the ranking of genes with respect to a statistical measure of differential expression. Two real datasets are analyzed to illustrate the methods and a simulation study is carried out to assess the relative performance of different statistical gene ranking measures. The work includes how to use the variance information to produce a list of significant targets and assess differential gene expressions under two conditions. The proposed method does not involve complicated formulas and does not require advanced programming skills. We conclude that the proposed methods offer useful analytical tools for identifying differentially expressed genes for further biological and clinical analysis.

[1]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[2]  Jialiang Li,et al.  Weighted area under the receiver operating characteristic curve and its application to gene selection , 2010, Journal of the Royal Statistical Society. Series C, Applied statistics.

[3]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[4]  Rebecca W. Doerge,et al.  Gene expression data: The technology and statistical analysis , 2003 .

[5]  J. Thomas,et al.  An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. , 2001, Genome research.

[6]  J. N. Arvesen Jackknifing U-statistics , 1968 .

[7]  M. Mitreva,et al.  Alpha-gliadin genes from the A, B, and D genomes of wheat contain different sets of celiac disease epitopes , 2006, BMC Genomics.

[8]  I Hinberg,et al.  Receiver operator characteristic (ROC) curves and non-normal data: an empirical study. , 1990, Statistics in medicine.

[9]  Russ B. Altman,et al.  Pattern Recognition of Genomic Features with Microarrays: Site Typing of Mycobacterium Tuberculosis Strains , 2000, ISMB.

[10]  Thomas J. Meyer,et al.  EGenBio: A Data Management System for Evolutionary Genomics and Biodiversity , 2006, BMC Bioinformatics.

[11]  T. Niu,et al.  A Statistical Procedure for Detecting Highly Correlated Genes with a Pre-Specified Candidate Gene in Microarray Analysis , 2008 .

[12]  P. Broberg Statistical methods for ranking differentially expressed genes , 2003, Genome Biology.

[13]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[14]  K. Berbaum,et al.  Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. , 1992, Investigative radiology.

[15]  Joseph Beyene,et al.  Tests for differential gene expression using weights in oligonucleotide microarray experiments , 2006, BMC Genomics.

[16]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[18]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[19]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[20]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[21]  D. Jones,et al.  Adjustments and measures of differential expression for microarray data , 2002, Bioinform..

[22]  Elizabeth Garrett-Mayer,et al.  Cross-study validation and combined analysis of gene expression microarray data. , 2007, Biostatistics.

[23]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[24]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[25]  Daniel Q. Naiman,et al.  Classifying Gene Expression Profiles from Pairwise mRNA Comparisons , 2004, Statistical applications in genetics and molecular biology.

[26]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[27]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[28]  W. Hoeffding A Class of Statistics with Asymptotically Normal Distribution , 1948 .

[29]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[30]  Wei Pan,et al.  Modified Nonparametric Approaches to Detecting Differentially Expressed Genes in Replicated Microarray Experiments , 2003, Bioinform..

[31]  M. Schummer,et al.  Selecting Differentially Expressed Genes from Microarray Experiments , 2003, Biometrics.

[32]  Terence P. Speed,et al.  A benchmark for Affymetrix GeneChip expression measures , 2004, Bioinform..