Nonlinear Dependence in the Discovery of Differentially Expressed Genes

Microarray data are used to determine which genes are active in response to a changing cell environment. Genes are “discovered” when they are significantly differentially expressed in the microarray data collected under the differing conditions. In one prevalent approach, all genes are assumed to satisfy a null hypothesis, ℍ 0, of no difference in expression. A false discovery (type 1 error) occurs when ℍ 0 is incorrectly rejected. The quality of a detection algorithm is assessed by estimating its number of false discoveries, 𝔉. Work involving the second-moment modeling of the z-value histogram (representing gene expression differentials) has shown significantly deleterious effects of intergene expression correlation on the estimate of 𝔉. This paper suggests that nonlinear dependencies could likewise be important. With an applied emphasis, this paper extends the “moment framework” by including third-moment skewness corrections in an estimator of 𝔉. This estimator combines observed correlation (corrected for sampling fluctuations) with the information from easily identifiable null cases. Nonlinear-dependence modeling reduces the estimation error relative to that of linear estimation. Third-moment calculations involve empirical densities of 3 × 3 covariance matrices estimated using very few samples. The principle of entropy maximization is employed to connect estimated moments to 𝔉 inference. Model results are tested with BRCA and HIV data sets and with carefully constructed simulations.

[1]  W. Raub From the National Institutes of Health. , 1990, JAMA.

[2]  Robert Tibshirani,et al.  Correlation-sharing for detection of differential gene expression , 2006, math/0608061.

[3]  Ahmed H. Tewfik,et al.  DNA Microarray Data Analysis: A Novel Biclustering Algorithm Approach , 2006, EURASIP J. Adv. Signal Process..

[4]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[5]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[6]  Chen-An Tsai,et al.  Testing for differentially expressed genes with microarray data. , 2003, Nucleic acids research.

[7]  Gene H. Golub,et al.  Matrix computations , 1983 .

[8]  Liang Chen,et al.  A statistical method for identifying differential gene-gene co-expression patterns , 2004, Bioinform..

[9]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[10]  A. Owen Variance of the number of false discoveries , 2005 .

[11]  Giovanni Parmigiani,et al.  Searching for differentially expressed gene combinations , 2005, Genome Biology.

[12]  Xing Qiu,et al.  Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes , 2005, Statistical applications in genetics and molecular biology.

[13]  Tianwei Yu,et al.  Incorporating Nonlinear Relationships in Microarray Missing Value Imputation , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  D. E. Whiteman,et al.  Estimation of probability densities by empirical density functions , 1978 .

[15]  Graziano Pesole,et al.  On the statistical assessment of classifiers using DNA microarray data , 2006, BMC Bioinformatics.

[16]  Feng Yang,et al.  Robust Feature Selection for Microarray Data Based on Multicriterion Fusion , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Defeng Sun,et al.  A Quadratically Convergent Newton Method for Computing the Nearest Correlation Matrix , 2006, SIAM J. Matrix Anal. Appl..

[18]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[19]  G. Gibson,et al.  Microarray Analysis , 2020, Definitions.

[20]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[21]  Jeffrey T Leek,et al.  The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. , 2007, Biostatistics.

[22]  Christina Kendziorski,et al.  Statistical methods for gene set co-expression analysis , 2009, Bioinform..

[23]  Xiao-Li Meng,et al.  Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage , 2000 .

[24]  Weixiong Zhang,et al.  A general co-expression network-based approach to gene expression analysis: comparison and applications , 2010, BMC Systems Biology.

[25]  B. Efron Size, power and false discovery rates , 2007, 0710.2245.

[26]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[27]  Jean Yee Hwa Yang,et al.  Two-Step Cross-Entropy Feature Selection for Microarrays—Power Through Complementarity , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[29]  Hong Yan,et al.  Searching for Coexpressed Genes in Three-Color cDNA Microarray Data Using a Probabilistic Model-Based Hough Transform , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Chen-An Tsai,et al.  Multi-class clustering and prediction in the analysis of microarray data. , 2005, Mathematical biosciences.

[31]  Sanghamitra Bandyopadhyay,et al.  A Biologically Inspired Measure for Coexpression Analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  S. Brendle,et al.  Calculus of Variations , 1927, Nature.

[33]  Junbai Wang,et al.  Computational biology of genome expression and regulation--a review of microarray bioinformatics. , 2008, Journal of environmental pathology, toxicology and oncology : official organ of the International Society for Environmental Toxicology and Cancer.

[34]  Huey-Miin Hsueh,et al.  Incorporating the number of true null hypotheses to improve power in multiple testing: application to gene microarray data , 2007 .

[35]  Tianzi Jiang,et al.  Characterizing the dynamic connectivity between genes by variable parameter regression and Kalman filtering based on temporal gene expression data , 2005, Bioinform..

[36]  Joseph P. Romano,et al.  Generalizations of the familywise error rate , 2005, math/0507420.

[37]  Arno Lukas,et al.  A dependency graph approach for the analysis of differential gene expression profiles. , 2009, Molecular bioSystems.

[38]  B. Lindqvist,et al.  Estimating the proportion of true null hypotheses, with application to DNA microarray data , 2005 .

[39]  Nicolas Pasquier,et al.  Interpreting Microarray Experiments Via Co-expressed Gene Groups Analysis (CGGA) , 2006, Discovery Science.

[40]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[41]  Merrill W. Liechty,et al.  Bayesian correlation estimation , 2004 .

[42]  Andrew B. Nobel,et al.  A statistical framework for testing functional categories in microarray data , 2008, 0803.3881.

[43]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[44]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[45]  Roger E Bumgarner,et al.  Cellular Gene Expression upon Human Immunodeficiency Virus Type 1 Infection of CD4+-T-Cell Lines , 2003, Journal of Virology.

[46]  L. Duret,et al.  Evolutionary origin and maintenance of coexpressed gene clusters in mammals. , 2006, Molecular biology and evolution.

[47]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[48]  Karuturi R. Krishna Murthy,et al.  Bias in the estimation of false discovery rate in microarray studies , 2005, Bioinform..

[49]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[50]  J. Wishart THE GENERALISED PRODUCT MOMENT DISTRIBUTION IN SAMPLES FROM A NORMAL MULTIVARIATE POPULATION , 1928 .

[51]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[52]  Susmita Datta,et al.  Evaluation of clustering algorithms for gene expression data , 2006, BMC Bioinformatics.

[53]  Marcel Brun,et al.  Clustering Algorithms: On Learning, Validation, Performance, and Applications to Genomics , 2009, Current genomics.

[54]  Ker-Chau Li,et al.  Genome-wide coexpression dynamics: Theory and application , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Stéphane Robin,et al.  Amplification biases: possible differences among deviating gene expressions , 2008, BMC Genomics.

[56]  H. Hotelling New Light on the Correlation Coefficient and its Transforms , 1953 .

[57]  Yen Kaow Ng,et al.  Positive correlation between gene coexpression and positional clustering in the zebrafish genome , 2009, BMC Genomics.

[58]  Andrew K. C. Wong,et al.  Discovering High-Order Patterns of Gene Expression Levels , 2008, J. Comput. Biol..

[59]  Xing Qiu,et al.  The effects of normalization on the correlation structure of microarray data , 2005, BMC Bioinformatics.

[60]  Yongqun He,et al.  CRCView: a web server for analyzing and visualizing microarray gene expression data using model-based clustering , 2007, Bioinform..

[61]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[63]  Antonio Reverter,et al.  Combining partial correlation and an information theory approach to the reversed engineering of gene co-expression networks , 2008, Bioinform..

[64]  James J. Chen,et al.  Multivariate analysis of variance test for gene set analysis , 2009, Bioinform..

[65]  Xing Qiu,et al.  A new gene selection procedure based on the covariance distance , 2010, Bioinform..

[66]  Viktor Martyanov,et al.  Identifying functional relationships within sets of co-expressed genes by combining upstream regulatory motif analysis and gene expression information , 2010, BMC Genomics.

[67]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[68]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[69]  Xing Qiu,et al.  Some Comments on Instability of False Discovery Rate Estimation , 2006, J. Bioinform. Comput. Biol..

[70]  Ingram Olkin,et al.  NOTE ON ‘THE JACOBIANS OF CERTAIN MATRIX TRANSFORMATIONS USEFUL IN MULTIVARIATE ANALYSIS’ , 1953 .