Adaptive trimmed t‐statistics for identifying predominantly high expression in a microarray experiment

Often, interesting candidate tumor markers are not only genes that show homogeneously higher expression (HHE) in tumor samples compared to control samples, but also genes with only predominantly higher expression (PHE), i.e. genes which exhibit higher expression in at least 80 per cent of tumor samples. Standard parametric test statistics used in the analysis of microarray experiments may fail with PHE as a consequence of the mixture of distributions present in the tumor group. As alternative we consider trimmed t-statistics which compare group mean values after removing outliers in each group. The trimming proportion can be chosen adaptively, either based on a boxplot outlier detection rule or by optimization over a series of tests with varying trimming proportions. The trimmed t-statistics can be plugged into the 'significance analysis of microarrays' (SAM) procedure, yielding the modified boxplot rule test (modBox) and the modified optimization test (modOpt), respectively. By means of simulation of microarray experiments, we show that modOpt is superior to contenders in detecting PHE, while there is only little loss in efficiency under HHE compared to SAM. Analysis of a real microarray experiment revealed that, out of nearly 29 000 genes, about 417 genes exhibiting PHE are detected by modOpt but missed by SAM.

[1]  Patrick Royston,et al.  The design of simulation studies in medical statistics , 2006, Statistics in medicine.

[2]  K. Yuen,et al.  The two-sample trimmed t for unequal population variances , 1974 .

[3]  R. Tibshirani,et al.  Outlier sums for differential gene expression analysis. , 2007, Biostatistics.

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  Schumacher Martin,et al.  Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples , 2008 .

[6]  M. Schummer,et al.  Selecting Differentially Expressed Genes from Microarray Experiments , 2003, Biometrics.

[7]  James F. Reed,et al.  Contributions to two-sample statistics , 2005 .

[8]  H. Lian MOST: detecting cancer differential gene expression. , 2007, Biostatistics.

[9]  Harald Binder,et al.  Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples , 2008, Statistical applications in genetics and molecular biology.

[10]  C. Sotiriou,et al.  Taking gene-expression profiling to the clinic: when will molecular signatures become relevant to patient care? , 2007, Nature Reviews Cancer.

[11]  Kathleen F. Kerr,et al.  Comments on the analysis of unbalanced microarray data , 2009, Bioinform..

[12]  Jae Won Lee,et al.  Comparison of various statistical methods for identifying differential gene expression in replicated microarray data , 2006, Statistical methods in medical research.

[13]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  R. Wilcox ANOVA: A Paradigm for Low Power and Misleading Measures of Effect Size? , 1995 .

[15]  Dung-Tsa Chen,et al.  The distribution-based p-value for the outlier sum in differential gene expression analysis. , 2010, Biometrika.

[16]  Baolin Wu,et al.  Cancer outlier differential gene expression detection. , 2007, Biostatistics.

[17]  J. Tchinda,et al.  Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. , 2006, Science.

[18]  W. J. Dixon,et al.  The approximate behaviour and performance of the two-sample trimmed t , 1973 .

[19]  H. Keselman,et al.  Repeated measures one-way ANOVA based on a modified one-step M-estimator. , 2003, The British journal of mathematical and statistical psychology.

[20]  Robert Tibshirani,et al.  SAM “Significance Analysis of Microarrays” Users guide and technical document , 2002 .

[21]  P. Sham,et al.  A note on the calculation of empirical P values from Monte Carlo procedures. , 2002, American journal of human genetics.

[22]  R. Wilcox Introduction to Robust Estimation and Hypothesis Testing , 1997 .