An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors

BackgroundThe analysis of large-scale gene expression data is a fundamental approach to functional genomics and the identification of potential drug targets. Results derived from such studies cannot be trusted unless they are adequately designed and reported. The purpose of this study is to assess current practices on the reporting of experimental design and statistical analyses in gene expression-based studies.MethodsWe reviewed hundreds of MEDLINE-indexed papers involving gene expression data analysis, which were published between 2003 and 2005. These papers were examined on the basis of their reporting of several factors, such as sample size, statistical power and software availability.ResultsAmong the examined papers, we concentrated on 293 papers consisting of applications and new methodologies. These papers did not report approaches to sample size and statistical power estimation. Explicit statements on data transformation and descriptions of the normalisation techniques applied prior to data analyses (e.g. classification) were not reported in 57 (37.5%) and 104 (68.4%) of the methodology papers respectively. With regard to papers presenting biomedical-relevant applications, 41(29.1 %) of these papers did not report on data normalisation and 83 (58.9%) did not describe the normalisation technique applied. Clustering-based analysis, the t-test and ANOVA represent the most widely applied techniques in microarray data analysis. But remarkably, only 5 (3.5%) of the application papers included statements or references to assumption about variance homogeneity for the application of the t-test and ANOVA. There is still a need to promote the reporting of software packages applied or their availability.ConclusionRecently-published gene expression data analysis studies may lack key information required for properly assessing their design quality and potential impact. There is a need for more rigorous reporting of important experimental factors such as statistical power and sample size, as well as the correct description and justification of statistical methods applied. This paper highlights the importance of defining a minimum set of information required for reporting on statistical design and analysis of expression data. By improving practices of statistical analysis reporting, the scientific community can facilitate quality assurance and peer-review processes, as well as the reproducibility of results.

[1]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[2]  Lopes Statistical Inference (Part 3): Statistical Hypothesis Testing and Confidence Interval Estimation. , 1998, The Brazilian journal of infectious diseases : an official publication of the Brazilian Society of Infectious Diseases.

[3]  Yudi Pawitan,et al.  False discovery rate, sensitivity and sample size for microarray studies , 2005, Bioinform..

[4]  John Quackenbush,et al.  Open source software for the analysis of microarray data. , 2003, BioTechniques.

[5]  D Jones,et al.  Statistical hypothesis testing in biology: a contradiction in terms. , 1986, Journal of economic entomology.

[6]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Sue-Jane Wang,et al.  Sample Size for Identifying Differentially Expressed Genes in Microarray Experiments , 2004, J. Comput. Biol..

[8]  D. Altman,et al.  Statistical reviewing for medical journals. , 1998, Statistics in medicine.

[9]  D G Altman,et al.  Statistics in medical journals: developments in the 1980s. , 1991, Statistics in medicine.

[10]  Ao Li,et al.  Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme , 2006, BMC Bioinformatics.

[11]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  D. Moher,et al.  Statistical power, sample size, and their reporting in randomized controlled trials. , 1994, JAMA.

[13]  S. Pocock,et al.  Statistical problems in the reporting of clinical trials. A survey of three medical journals. , 1987, The New England journal of medicine.

[14]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[15]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[16]  R A Irizarry,et al.  On the utility of pooling biological samples in microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  G. Churchill,et al.  Statistical design and the analysis of gene expression microarray data. , 2007, Genetical research.

[18]  Mark Reimers,et al.  Statistical Analysis of Microarray Data , 2005, Addiction biology.

[19]  K. Jöckel,et al.  Software packages for quantitative microarray-based gene expression analysis. , 2003, Current pharmaceutical biotechnology.

[20]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[21]  P. Krajewski,et al.  Statistical methods for microarray assays. , 2002, Journal of Applied Genetics.

[22]  R. McIndoe,et al.  Microarray experimental design: power and sample size considerations. , 2003, Physiological genomics.

[23]  Joel Gagnier,et al.  The quality of randomized trial reporting in leading medical journals since the revised CONSORT statement. , 2005, Contemporary clinical trials.

[24]  J. Seldrup Whatever Happened to the T-Test? , 1997 .

[25]  Dale L. Wilson,et al.  New Normalization Methods for CDNA Microarray Data , 2003, Bioinform..

[26]  J M Bland,et al.  Statistics Notes: One and two sided tests of significance , 1994 .

[27]  Mei-Ling Ting Lee,et al.  Split-plot microarray experiments: issues of design, power and sample size. , 2005, Applied bioinformatics.

[28]  S. Lange,et al.  Adjusting for multiple testing--when and how? , 2001, Journal of clinical epidemiology.

[29]  R. Simon,et al.  Sample size determination in microarray experiments for class comparison and prognostic classification. , 2005, Biostatistics.

[30]  B. H. Layne,et al.  Low power, type II errors, and other statistical problems in recent cardiovascular research. , 1997, The American journal of physiology.

[31]  L W Doyle,et al.  Basic concepts of statistical reasoning: Hypothesis tests and the t‐test , 2001, Journal of paediatrics and child health.

[32]  R. Marino Statistical hypothesis testing. , 1995, Archives of physical medicine and rehabilitation.

[33]  Sin-Ho Jung,et al.  Sample size calculation for multiple testing in microarray data analysis. , 2005, Biostatistics.

[34]  R Gasko,et al.  Statistical hypothesis testing--how exact are exact p-values? , 2003, Bratislavske lekarske listy.

[35]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[36]  Eric P. Hoffman,et al.  An interactive power analysis tool for microarray hypothesis testing and generation , 2006, Bioinform..

[37]  K E Peace The alternative hypothesis: one-sided or two-sided? , 1989, Journal of clinical epidemiology.

[38]  Ernst Wit,et al.  Statistics for Microarrays : Design, Analysis and Inference , 2004 .

[39]  D. Wilkins,et al.  The effect of normalization on microarray data analysis. , 2004, DNA and cell biology.

[40]  G A Whitmore,et al.  Power and sample size for DNA microarray studies , 2002, Statistics in medicine.

[41]  D G Altman,et al.  Statistical reviewing policies of medical journals: caveat lector? , 1998, Journal of general internal medicine.

[42]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.

[43]  L M Bouter,et al.  The ethics of sample size: two-sided testing and one-sided thinking. , 2001, Journal of clinical epidemiology.

[44]  Sin-Ho Jung,et al.  Sample size for FDR-control in microarray data analysis , 2005, Bioinform..

[45]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[46]  Samuel S. Wu,et al.  A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays. , 2001, Physiological genomics.

[47]  G. Churchill Using ANOVA to analyze microarray data. , 2004, BioTechniques.

[48]  Claire Tilstone DNA microarrays: Vital statistics , 2003, Nature.

[49]  D G Altman,et al.  Statistics in medical journals: some recent trends. , 2000, Statistics in medicine.

[50]  Sue-Jane Wang,et al.  Sample size for gene expression microarray experiments , 2005, Bioinform..

[51]  Roger E Bumgarner,et al.  Sample size for detecting differentially expressed genes in microarray experiments , 2004, BMC Genomics.

[52]  Taesung Park,et al.  Evaluation of normalization methods for microarray data , 2003 .

[53]  Fred A. Wright,et al.  Practical FDR-based sample size calculations in microarray experiments , 2005, Bioinform..

[54]  Anoop Grewal,et al.  Tools for Analyzing Microarray Expression Data , 2000 .

[55]  E. Schuster,et al.  Increased power of microarray analysis by use of an algorithm based on a multivariate procedure , 2005, Bioinform..

[56]  George Stephanopoulos,et al.  Determination of minimum sample size and discriminatory expression patterns in microarray data , 2002, Bioinform..

[57]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[58]  D. Pisetsky,et al.  DNA microarrays: boundless technology or bound by technology? Guidelines for studies using microarray technology. , 2002, Arthritis and rheumatism.

[59]  Ida Scheel,et al.  The influence of missing value imputation on detection of differentially expressed genes from microarray data , 2005, Bioinform..