A Survey and Comparative Study of Statistical Tests for Identifying Differential Expression from Microarray Data

DNA microarray is a powerful technology that can simultaneously determine the levels of thousands of transcripts (generated, for example, from genes/miRNAs) across different experimental conditions or tissue samples. The motto of differential expression analysis is to identify the transcripts whose expressions change significantly across different types of samples or experimental conditions. A number of statistical testing methods are available for this purpose. In this paper, we provide a comprehensive survey on different parametric and non-parametric testing methodologies for identifying differential expression from microarray data sets. The performances of the different testing methods have been compared based on some real-life miRNA and mRNA expression data sets. For validating the resulting differentially expressed miRNAs, the outcomes of each test are checked with the information available for miRNA in the standard miRNA database PhenomiR 2.0. Subsequently, we have prepared different simulated data sets of different sample sizes (from 10 to 100 per group/population) and thereafter the power of each test have been calculated individually. The comparative simulated study might lead to formulate robust and comprehensive judgements about the performance of each test in the basis of assumption of data distribution. Finally, a list of advantages and limitations of the different statistical tests has been provided, along with indications of some areas where further studies are required.

[1]  S. P. Fodor,et al.  Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays , 1999, Nature Genetics.

[2]  S. Mohamed,et al.  Statistical Normalization and Back Propagation for Classification , 2022 .

[3]  David J. Spiegelhalter,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Comment. , 2008 .

[4]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[5]  E. Wit Design and Analysis of DNA Microarray Investigations , 2004, Human Genomics.

[6]  Ujjwal Maulik,et al.  Integrated Statistical and Rule-Mining Techniques for Dna Methylation and Gene Expression Data Analysis , 2013, J. Artif. Intell. Soft Comput. Res..

[7]  Ujjwal Maulik,et al.  Integrated analysis of gene expression and genome-wide DNA methylation for tumor prediction: An association rule mining-based approach , 2013, 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[8]  H. Büning,et al.  Jarque–Bera Test and its Competitors for Testing Normality – A Power Comparison , 2007 .

[9]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[10]  Martin T. Wells,et al.  Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments , 2010, 1101.0905.

[11]  Erik Kristiansson,et al.  BMC Bioinformatics BioMed Central Methodology article Weighted analysis of general microarray experiments , 2007 .

[12]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[13]  Nancy R. Zhang,et al.  Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data , 2013, 1301.2420.

[14]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[15]  Yen-Liang Chen,et al.  An overlapping cluster algorithm to provide non-exhaustive clustering , 2006, Eur. J. Oper. Res..

[16]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[17]  S. Bandyopadhyay,et al.  Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes , 2009, BMC Bioinformatics.

[18]  Ujjwal Maulik,et al.  Towards improving fuzzy clustering using support vector machine: Application to gene expression data , 2009, Pattern Recognit..

[19]  J. Aldrich Correlations Genuine and Spurious in Pearson and Yule , 1995 .

[20]  Marti J. Anderson,et al.  Permutation tests for univariate or multivariate analysis of variance and regression , 2001 .

[21]  Mickael Guedj,et al.  Should We Abandon the t-Test in the Analysis of Gene Expression Microarray Data: A Comparison of Variance Modeling Strategies , 2010, PloS one.

[22]  AN Kolmogorov-Smirnov,et al.  Sulla determinazione empírica di uma legge di distribuzione , 1933 .

[23]  H. Horvitz,et al.  MicroRNA expression profiles classify human cancers , 2005, Nature.

[24]  Jean-Jacques Daudin,et al.  VarMixt: efficient variance modelling for the differential analysis of replicated gene expression data , 2005, Bioinform..

[25]  I. Simon,et al.  Studying and modelling dynamic biological processes using time-series gene expression data , 2012, Nature Reviews Genetics.

[26]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[27]  H. Büning,et al.  Jarque-Bera test and its competitors for testing normality , 2004 .

[28]  John D. Storey,et al.  False Discovery Rate , 2020, International Encyclopedia of Statistical Science.

[29]  Jennifer L. O'Day Statistical Significance for Genome Wide Studies Under Unequal Variance , 2015 .

[30]  Isaac Dialsingh,et al.  Multiple Hypothesis Testing : A Review , 2014 .

[31]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[32]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[33]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[34]  John D. Storey A direct approach to false discovery rates , 2002 .

[35]  Massimiliano Pontil,et al.  Prediction of hot spot residues at protein-protein interfaces by combining machine learning and energy-based methods , 2009, BMC Bioinformatics.

[36]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[37]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[38]  Ujjwal Maulik,et al.  Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing microarray data , 2010, Comput. Oper. Res..

[39]  Jae Won Lee,et al.  Comparison of various statistical methods for identifying differential gene expression in replicated microarray data , 2006, Statistical methods in medical research.

[40]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[41]  A. Scherer Batch Effects and Noise in Microarray Experiments , 2009 .

[42]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[43]  V. Corces,et al.  CTCF: Master Weaver of the Genome , 2009, Cell.

[44]  Robert Nadon,et al.  Comparison of small n statistical tests of differential expression applied to microarrays , 2009, BMC Bioinformatics.

[45]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[46]  Andrew J Vickers,et al.  Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data , 2005, BMC medical research methodology.

[47]  Mihai Aldea,et al.  Expression profiling soybean response to Pseudomonas syringae reveals new defense-related genes and rapid HR-specific downregulation of photosynthesis. , 2005, Molecular plant-microbe interactions : MPMI.

[48]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[49]  Baolin Wu,et al.  Differential gene expression detection and sample classification using penalized linear regression models , 2006, Bioinform..

[50]  Jean-Louis Foulley,et al.  A structural mixed model for variances in differential gene expression studies. , 2007, Genetical research.

[51]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[52]  Korbinian Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[53]  Mahmoud Mounir,et al.  On biclustering of gene expression data , 2015, 2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS).

[54]  M. Omair Ahmad,et al.  Identification of Differentially Expressed Genes for Time-Course Microarray Data Based on Modified RM ANOVA , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  Sanghamitra Bandyopadhyay,et al.  A Biologically Inspired Measure for Coexpression Analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[56]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[57]  Wynne W. Chin,et al.  Generating Non-normal Data for Simulation of Structural Equation Models Using Mattson's Method , 2002 .

[58]  Ujjwal Maulik,et al.  Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification , 2010, PloS one.

[59]  Richard Simon,et al.  A random variance model for detection of differential gene expression in small microarray experiments , 2003, Bioinform..

[60]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[61]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[62]  Yudi Pawitan,et al.  False discovery rate, sensitivity and sample size for microarray studies , 2005, Bioinform..

[63]  G. Church,et al.  Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset , 2005, Genome Biology.

[64]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[65]  B. L. Welch THE SIGNIFICANCE OF THE DIFFERENCE BETWEEN TWO MEANS WHEN THE POPULATION VARIANCES ARE UNEQUAL , 1938 .

[66]  B. Efron Robbins, Empirical Bayes, And Microarrays , 2001 .

[67]  K. Rothman Curbing type I and type II errors , 2010, European Journal of Epidemiology.

[68]  B. Efron Size, power and false discovery rates , 2007, 0710.2245.

[69]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[70]  Raymond H. Myers,et al.  Probability and Statistics for Engineers and Scientists. , 1973 .

[71]  K. K. Jose,et al.  Statistical tests for identification of differentially expressed genes in cDNA microarray experiments , 2008 .

[72]  Sudhir Varma,et al.  Microarray-Based Analysis of Differential Gene Expression between Infective and Noninfective Larvae of Strongyloides stercoralis , 2011, PLoS Neglected Tropical Diseases.

[73]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[74]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[75]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[76]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[77]  Y. B. Wah,et al.  Power comparisons of Shapiro-Wilk , Kolmogorov-Smirnov , Lilliefors and Anderson-Darling tests , 2011 .

[78]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[79]  Andreas Scherer,et al.  Batch Effects and Noise in Microarray Experiments: Sources and Solutions , 2009 .

[80]  Eric Depiereux,et al.  A benchmark for statistical microarray data analysis that preserves actual biological and technical variance , 2010, BMC Bioinformatics.

[81]  W. Gregory Alvord,et al.  A microarray analysis for differential gene expression in the soybean genome using Bioconductor and R , 2007, Briefings Bioinform..

[82]  P. Müller,et al.  Bayesian inference for gene expression and proteomics , 2006 .

[83]  A. McCluskey,et al.  Statistics IV: Interpreting the results of statistical tests , 2007 .