Validation and characterization of DNA microarray gene expression data distribution and associated moments

BackgroundThe data from DNA microarrays are increasingly being used in order to understand effects of different conditions, exposures or diseases on the modulation of the expression of various genes in a biological system. This knowledge is then further used in order to generate molecular mechanistic hypotheses for an organism when it is exposed to different conditions. Several different methods have been proposed to analyze these data under different distributional assumptions on gene expression. However, the empirical validation of these assumptions is lacking.ResultsBest fit hypotheses tests, moment-ratio diagrams and relationships between the different moments of the distribution of the gene expression was used to characterize the observed distributions. The data are obtained from the publicly available gene expression database, Gene Expression Omnibus (GEO) to characterize the empirical distributions of gene expressions obtained under varying experimental situations each of which providing relatively large number of samples for hypothesis testing. All data were obtained from either of two microarray platforms - the commercial Affymetrix mouse 430.2 platform and a non-commercial Rosetta/Merck one. The data from each platform were preprocessed in the same manner.ConclusionsThe null hypotheses for goodness of fit for all considered univariate theoretical probability distributions (including the Normal distribution) are rejected for more than 50% of probe sets on the Affymetrix microarray platform at a 95% confidence level, suggesting that under the tested conditions a priori assumption of any of these distributions across all probe sets is not valid. The pattern of null hypotheses rejection was different for the data from Rosetta/Merck platform with only around 20% of the probe sets failing the logistic distribution goodness-of-fit test. We find that there are statistically significant (at 95% confidence level based on the F-test for the fitted linear model) relationships between the mean and the logarithm of the coefficient of variation of the distributions of the logarithm of gene expressions. An additional novel statistically significant quadratic relationship between the skewness and kurtosis is identified. Data from both microarray platforms fail to identify with any one of the chosen theoretical probability distributions from an analysis of the l-moment ratio diagram.

[1]  M. Kendall,et al.  The Advanced Theory of Statistics, Vol. 1: Distribution Theory , 1959 .

[2]  M. Kendall,et al.  The advanced theory of statistics , 1945 .

[3]  E. Lehmann,et al.  Nonparametrics: Statistical Methods Based on Ranks , 1976 .

[4]  Mitchell J. Mergenthaler Nonparametrics: Statistical Methods Based on Ranks , 1979 .

[5]  B. Efron Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods , 1981 .

[6]  P. D. M. Macdonald MIX: An Interactive Program for Fitting Mixtures of Distributions , 1986 .

[7]  J. Hosking L‐Moments: Analysis and Estimation of Distributions Using Linear Combinations of Order Statistics , 1990 .

[8]  Averill M. Law,et al.  Simulation modelling and analysis , 1991 .

[9]  R. Vogel,et al.  L moment diagrams should replace product moment diagrams , 1993 .

[10]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[11]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[12]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[13]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[15]  P. Nelson,et al.  Project normal: Defining normal variance in mouse gene expression , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[16]  David M. Rocke,et al.  A Model for Measurement Error for Gene Expression Arrays , 2001, J. Comput. Biol..

[17]  R. Stoughton,et al.  Use of hybridization kinetics for differentiating specific from non-specific binding to oligonucleotide microarrays. , 2002, Nucleic acids research.

[18]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[19]  Y. Tu,et al.  Quantitative noise analysis for gene expression microarray experiments , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[21]  G. W. Hatfield,et al.  Differential analysis of DNA microarray gene expression data , 2003, Molecular microbiology.

[22]  Ash A. Alizadeh,et al.  Individuality and variation in gene expression patterns in human blood , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[23]  M. Owen,et al.  Cis-acting variation in the expression of a high proportion of genes in human brain , 2003, Human Genetics.

[24]  G. Parmigiani,et al.  Gene expression variation in the adult human retina. , 2004, Human Molecular Genetics.

[25]  Jae K. Lee,et al.  Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays , 2003, Bioinform..

[26]  Felix Naef,et al.  A study of accuracy, precision in oligonucleotide arrays: extracting more signal at large concentrations , 2002, Bioinform..

[27]  M. Akritas,et al.  Heteroscedastic One-Way ANOVA and Lack-of-Fit Tests , 2004 .

[28]  P. Buckland Allele-specific gene expression differences in humans. , 2004, Human molecular genetics.

[29]  Rafael A. Irizarry,et al.  A Model-Based Background Adjustment for Oligonucleotide Expression Arrays , 2004 .

[30]  E. Schadt,et al.  Genetic inheritance of gene expression in human cell lines. , 2004, American journal of human genetics.

[31]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[32]  T. Yamanaka,et al.  The TAO-Gen Algorithm for Identifying Gene Interaction Networks with Application to SOS Repair in E. coli , 2004, Environmental health perspectives.

[33]  Paul P. Wang,et al.  Advances to Bayesian network inference for generating causal networks from observational biological data , 2004, Bioinform..

[34]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[35]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[36]  Baolin Wu,et al.  Differential gene expression detection using penalized linear regression models: the improved SAM statistics , 2005, Bioinform..

[37]  J. Eady,et al.  Variation in gene expression profiles of peripheral blood mononuclear cells from healthy volunteers. , 2005, Physiological genomics.

[38]  J. Raser,et al.  Noise in Gene Expression: Origins, Consequences, and Control , 2005, Science.

[39]  Sergio Contrino,et al.  ArrayExpress—a public repository for microarray gene expression data at the EBI , 2004, Nucleic Acids Res..

[40]  Vineet K. Sharma,et al.  Assessing natural variations in gene expression in humans by comparing with monozygotic twins using microarrays. , 2005, Physiological genomics.

[41]  P. Nelson,et al.  The contributions of normal variation and genetic background to mammalian gene expression , 2006, Genome Biology.

[42]  Huiling He,et al.  Allelic variation in gene expression in thyroid tissue. , 2005, Thyroid : official journal of the American Thyroid Association.

[43]  L. Kunkel,et al.  Variations in gene expression among different types of human skeletal muscle , 2005, Muscle & nerve.

[44]  S. Horvath,et al.  Evidence for anti-Burkitt tumour globulins in Burkitt tumour patients and healthy individuals. , 1967, British Journal of Cancer.

[45]  A. Arnold,et al.  Tissue-specific expression and regulation of sexually dimorphic genes in mice. , 2006, Genome research.

[46]  E. Schadt,et al.  Genetic and Genomic Analysis of a Fat Mass Trait with Complex Inheritance Reveals Marked Sex Specificity , 2006, PLoS genetics.

[47]  Henrik Andersson,et al.  Evaluation of microarray data normalization procedures using spike-in experiments , 2006, BMC Bioinformatics.

[48]  P. Brown,et al.  Gene expression patterns in human placenta. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[49]  P. Visscher,et al.  Replicated effects of sex and genotype on gene expression in human lymphoblastoid cell lines. , 2007, Human molecular genetics.

[50]  M. Bissell Allelic Variation in Gene Expression in Thyroid TissueHe H, Olesnanik K, Nagy R, et al (Ohio State Univ, Columbus; NIH, Bethesda, Md) Thyroid 15:660–667, 2005§ , 2007 .

[51]  John Quackenbush,et al.  Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories , 2008, BMC Genomics.

[52]  BMC Bioinformatics , 2005 .

[53]  E. O’Shea,et al.  Living with noisy genes: how cells function reliably with inherent variability in gene expression. , 2007, Annual review of biophysics and biomolecular structure.

[54]  R. Wolfinger,et al.  A comparison of transcriptomic and metabonomic technologies for identifying biomarkers predictive of two-year rodent cancer bioassays. , 2007, Toxicological sciences : an official journal of the Society of Toxicology.

[55]  T. Kawai,et al.  Gene expression signature in peripheral blood cells from medical students exposed to chronic psychological stress , 2007, Biological Psychology.

[56]  S. Pradervand,et al.  Homer1a is a core brain molecular correlate of sleep loss , 2007, Proceedings of the National Academy of Sciences.

[57]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[58]  J. Castle,et al.  Definition, conservation and epigenetics of housekeeping and tissue-enriched genes , 2009, BMC Genomics.

[59]  L. Sanderson,et al.  Effect of Synthetic Dietary Triglycerides: A Novel Research Paradigm for Nutrigenomics , 2008, PloS one.

[60]  John D. Storey,et al.  Mapping the Genetic Architecture of Gene Expression in Human Liver , 2008, PLoS biology.

[61]  S. Horvath,et al.  Variations in DNA elucidate molecular networks that cause disease , 2008, Nature.

[62]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[63]  K. Langohr,et al.  Role of sex and time of blood sampling in SOD1 and SOD2 expression variability. , 2008, Clinical biochemistry.

[64]  T. Nikolskaya,et al.  Use of short-term transcriptional profiles to assess the long-term cancer-related safety of environmental and industrial chemicals. , 2009, Toxicological sciences : an official journal of the Society of Toxicology.

[65]  Lawrence Hunter,et al.  Biomedical Discovery Acceleration, with Applications to Craniofacial Development , 2009, PLoS Comput. Biol..

[66]  D. Wilson Tissue , 2009, The Lancet.

[67]  T. Drake,et al.  Upstream transcription factor 1 influences plasma lipid and metabolic traits in mice. , 2010, Human molecular genetics.

[68]  J. Sinsheimer,et al.  Expression Quantitative Trait Loci: Replication, Tissue- and Sex-Specificity in Mice , 2010, Genetics.