Robust differential expression analysis by learning discriminant boundary in multi-dimensional space of statistical attributes

BackgroundPerforming statistical tests is an important step in analyzing genome-wide datasets for detecting genomic features differentially expressed between conditions. Each type of statistical test has its own advantages in characterizing certain aspects of differences between population means and often assumes a relatively simple data distribution (e.g., Gaussian, Poisson, negative binomial, etc.), which may not be well met by the datasets of interest. Making insufficient distributional assumptions can lead to inferior results when dealing with complex differential expression patterns.ResultsWe propose to capture differential expression information more comprehensively by integrating multiple test statistics, each of which has relatively limited capacity to summarize the observed differential expression information. This work addresses a general application scenario, in which users want to detect as many as DEFs while requiring the false discovery rate (FDR) to be lower than a cut-off. We treat each test statistic as a basic attribute, and model the detection of differentially expressed genomic features as learning a discriminant boundary in a multi-dimensional space of basic attributes. We mathematically formulated our goal as a constrained optimization problem aiming to maximize discoveries satisfying a user-defined FDR. An effective algorithm, Discriminant-Cut, has been developed to solve an instantiation of this problem. Extensive comparisons of Discriminant-Cut with 13 existing methods were carried out to demonstrate its robustness and effectiveness.ConclusionsWe have developed a novel machine learning methodology for robust differential expression analysis, which can be a new avenue to significantly advance research on large-scale differential expression analysis.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Mario Medvedovic,et al.  Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments , 2006, BMC Bioinformatics.

[3]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[4]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[5]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[6]  John D. Storey A direct approach to false discovery rates , 2002 .

[7]  Robert Tibshirani,et al.  Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data , 2013, Statistical methods in medical research.

[8]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[9]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[10]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[11]  John D. Storey The optimal discovery procedure: a new approach to simultaneous significance testing , 2007 .

[12]  Adina I. Tarcolea,et al.  Combining Significance of Correlated Statistics with Application to Panel Data , 2006 .

[13]  Jeff H. Chang,et al.  The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq , 2011 .

[14]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[15]  A D Long,et al.  Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and A Bayesian Statistical Framework , 2001, The Journal of Biological Chemistry.

[16]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[17]  Bradley Efron,et al.  Microarrays empirical Bayes methods, and false discovery rates , 2001 .

[18]  Robert Tibshirani,et al.  SAM “Significance Analysis of Microarrays” Users guide and technical document , 2002 .

[19]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[20]  L. AuerPaul,et al.  A Two-Stage Poisson Model for Testing RNA-Seq Data , 2011 .

[21]  R. W. Wedderburn Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method , 1974 .

[22]  A. Conesa,et al.  Differential expression in RNA-seq: a matter of depth. , 2011, Genome research.

[23]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[24]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[25]  John W. V. Storey The False Discovery Rate: A Bayesian Interpre-tation and the q-value , 2001 .

[26]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[27]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  David P. Kreil,et al.  A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium , 2014, Nature Biotechnology.

[29]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[30]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[31]  Richard J. Fox,et al.  A two-sample Bayesian t-test for microarray data , 2006, BMC Bioinformatics.

[32]  George C. Tseng,et al.  Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline , 2013, BMC Bioinformatics.

[33]  Pritam Chanda,et al.  Statistical Applications in Genetics and Molecular Biology Information Metrics in Genetic Epidemiology , 2011 .

[34]  Catalin C. Barbacioru,et al.  Evaluation of DNA microarray results with quantitative gene expression platforms , 2006, Nature Biotechnology.

[35]  P. Levitt,et al.  Comparative DNA methylation among females with neurodevelopmental disorders and seizures identifies TAC1 as a MeCP2 target gene , 2013, Journal of Neurodevelopmental Disorders.

[36]  A. W. van der Vaart,et al.  Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. , 2013, Biostatistics.

[37]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[38]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[39]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..

[40]  Xin Xu,et al.  Combining dependent tests for linkage or association across multiple phenotypic traits. , 2003, Biostatistics.

[41]  S. Fernandez,et al.  Fully Moderated T-statistic for Small Sample Size Gene Expression Arrays , 2011, Statistical applications in genetics and molecular biology.

[42]  Jia Li,et al.  An adaptively weighted statistic for detecting differential gene expression when combining multiple transcriptomic studies , 2011, 1108.3180.

[43]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[44]  Xuegong Zhang,et al.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data , 2010, Bioinform..

[45]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[46]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[47]  R. Guigó,et al.  Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[48]  S. P. Fodor,et al.  High density synthetic oligonucleotide arrays , 1999, Nature Genetics.

[49]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..