Statistical Development and Evaluation of Microarray Gene Expression Data Filters

Filtering is a common practice used to simplify the analysis of microarray data by removing from subsequent consideration probe sets believed to be unexpressed. The m/n filter, which is widely used in the analysis of Affymetrix data, removes all probe sets having fewer than m present calls among a set of n chips. The m/n filter has been widely used without considering its statistical properties. The level and power of the m/n filter are derived. Two alternative filters, the pooled p-value filter and the error-minimizing pooled p-value filter are proposed. The pooled p-value filter combines information from the present-absent p-values into a single summary p-value which is subsequently compared to a selected significance threshold. We show that pooled p-value filter is the uniformly most powerful statistical test under a reasonable beta model and that it exhibits greater power than the m/n filter in all scenarios considered in a simulation study. The error-minimizing pooled p-value filter compares the summary p-value with a threshold determined to minimize a total-error criterion based on a partition of the distribution of all probes' summary p-values. The pooled p-value and error-minimizing pooled p-value filters clearly perform better than the m/n filter in a case-study analysis. The case-study analysis also demonstrates a proposed method for estimating the number of differentially expressed probe sets excluded by filtering and subsequent impact on the final analysis. The filter impact analysis shows that the use of even the best filter may hinder, rather than enhance, the ability to discover interesting probe sets or genes. S-plus and R routines to implement the pooled p-value and error-minimizing pooled p-value filters have been developed and are available from www.stjuderesearch.org/depts/biostats/index.html.

[1]  Daniel J. Park,et al.  A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies , 2006, Nature Biotechnology.

[2]  Chen-An Tsai,et al.  Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data , 2003, Biometrics.

[3]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[4]  W. Kruskal Ordinal Measures of Association , 1958 .

[5]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[6]  Robert L. Mason,et al.  Statistical Design and Analysis of Experiments , 2003 .

[7]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[8]  C. J. Stone,et al.  A study of logspline density estimation , 1991 .

[9]  Yoav Benjamini,et al.  Identifying differentially expressed genes using false discovery rate controlling procedures , 2003, Bioinform..

[10]  Cheng Cheng,et al.  Improving false discovery rate estimation , 2004, Bioinform..

[11]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[12]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[13]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[14]  B. Bowerman Statistical Design and Analysis of Experiments, with Applications to Engineering and Science , 1989 .

[15]  M. Soller,et al.  A whole genome scan for quantitative trait loci affecting milk protein percentage in Israeli-Holstein cattle, by means of selective milk DNA pooling in a daughter design, using an adjusted false discovery rate criterion. , 2001, Genetics.

[16]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[17]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  John D. Storey A direct approach to false discovery rates , 2002 .

[19]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[20]  Cheng Cheng,et al.  Robust estimation of the false discovery rate , 2006, Bioinform..