Selection of differentially expressed genes in microarray data analysis

One common objective in microarray experiments is to identify a subset of genes that express differentially among different experimental conditions, for example, between drug treatment and no drug treatment. Often, the goal is to determine the underlying relationship between poor versus good gene signatures for identifying biological functions or predicting specific therapeutic outcomes. Because of the complexity in studying hundreds or thousands of genes in an experiment, selection of a subset of genes to enhance relationships among the underlying biological structures or to improve prediction accuracy of clinical outcomes has been an important issue in microarray data analysis. Selection of differentially expressed genes is a two-step process. The first step is to select an appropriate test statistic and compute the P-value. The genes are ranked according to their P-values as evidence of differential expression. The second step is to assign a significance level, that is, to determine a cutoff threshold from the P-values in accordance with the study objective. In this paper, we consider four commonly used statistics, t-, S- (SAM), U-(Mann–Whitney) and M-statistics to compute the P-values for gene ranking. We consider the family-wise error and false discovery rate false-positive error-controlled procedures to select a limited number of genes, and a receiver-operating characteristic (ROC) approach to select a larger number of genes for assigning the significance level. The ROC approach is particularly useful in genomic/genetic profiling studies. The well-known colon cancer data containing 22 normal and 40 tumor tissues are used to illustrate different gene ranking and significance level assignment methods for applications to genomic/genetic profiling studies. The P-values computed from the t-, U- and M-statistics are very similar. We discuss the common practice that uses the P-value, false-positive error probability, as the primary criterion, and then uses the fold-change as a surrogate measure of biological significance for gene selection. The P-value and the fold-change can be pictorially shown simultaneously in a volcano plot. We also address several issues on gene selection.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[3]  Huey-Miin Hsueh,et al.  Incorporating the number of true null hypotheses to improve power in multiple testing: application to gene microarray data , 2007 .

[4]  R. Simon,et al.  Use of genomic signatures in therapeutics development in oncology and other diseases , 2006, The Pharmacogenomics Journal.

[5]  T. Speed,et al.  Statistical issues in cDNA microarray data analysis. , 2003, Methods in molecular biology.

[6]  James J. Chen,et al.  Multiple‐Testing Strategy for Analyzing cDNA Array Data on Gene Expression , 2004, Biometrics.

[7]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[8]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[9]  Russell D. Wolfinger,et al.  The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster , 2001, Nature Genetics.

[10]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[11]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[12]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Bryan Frank,et al.  Independence and reproducibility across microarray platforms , 2005, Nature Methods.

[14]  Sue-Jane Wang,et al.  Sample size for gene expression microarray experiments , 2005, Bioinform..

[15]  L. Qin,et al.  Empirical evaluation of data transformations and ranking statistics for microarray analysis. , 2004, Nucleic acids research.

[16]  Huey-miin Hsueh,et al.  Comparison of Methods for Estimating the Number of True Null Hypotheses in Multiplicity Testing , 2003, Journal of biopharmaceutical statistics.

[17]  Chen-An Tsai,et al.  Gene selection for sample classifications in microarray experiments. , 2004, DNA and cell biology.

[18]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[19]  Chen-An Tsai,et al.  Testing for differentially expressed genes with microarray data. , 2003, Nucleic acids research.

[20]  Kathleen F. Kerr,et al.  Standardizing global gene expression analysis between laboratories and across platforms , 2005, Nature Methods.

[21]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.