Identification of significant features in DNA microarray data

DNA microarrays are a relatively new technology that can simultaneously measure the expression level of thousands of genes. They have become an important tool for a wide variety of biological experiments. One of the most common goals of DNA microarray experiments is to identify genes associated with biological processes of interest. Conventional statistical tests often produce poor results when applied to microarray data owing to small sample sizes, noisy data, and correlation among the expression levels of the genes. Thus, novel statistical methods are needed to identify significant genes in DNA microarray experiments. This article discusses the challenges inherent in DNA microarray analysis and describes a series of statistical techniques that can be used to overcome these challenges. The problem of multiple hypothesis testing and its relation to microarray studies are also considered, along with several possible solutions.

[1]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[2]  R. Tibshirani,et al.  "Preconditioning" for feature selection and regression in high-dimensional problems , 2007, math/0703858.

[3]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[4]  Marina Vannucci,et al.  Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data , 2011, Bioinform..

[5]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Rainer Breitling,et al.  Iterative Group Analysis (iGA): A simple tool to enhance sensitivity and facilitate interpretation of microarray experiments , 2004, BMC Bioinformatics.

[7]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[8]  David J. Spiegelhalter,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Comment. , 2008 .

[9]  Michal Linial,et al.  Novel Unsupervised Feature Filtering of Biological Data , 2006, ISMB.

[10]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[11]  Purvesh Khatri,et al.  Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate , 2003, Nucleic Acids Res..

[12]  Hao Wu,et al.  MAANOVA: A Software Package for the Analysis of Spotted cDNA Microarray Experiments , 2003 .

[13]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[14]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[15]  Sangsoo Kim,et al.  GSA-SNP: a general approach for gene set analysis of polymorphisms , 2010, Nucleic Acids Res..

[16]  Nicolai Meinshausen,et al.  False Discovery Control for Multiple Tests of Association Under General Dependence , 2006 .

[17]  Chris Sander,et al.  Characterizing gene sets with FuncAssociate , 2003, Bioinform..

[18]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[19]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[20]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[22]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[23]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[24]  Robert Nadon,et al.  Comparison of small n statistical tests of differential expression applied to microarrays , 2009, BMC Bioinformatics.

[25]  P. Hall,et al.  Robustness of multiple testing procedures against dependence , 2009, 0903.0464.

[26]  Devin C. Koestler,et al.  Semi-supervised recursively partitioned mixture models for identifying cancer subtypes , 2010, Bioinform..

[27]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[28]  T. Foster,et al.  Gene Microarrays in Hippocampal Aging: Statistical Profiling Identifies Novel Processes Correlated with Cognitive Impairment , 2003, The Journal of Neuroscience.

[29]  R. Tibshirani,et al.  Empirical bayes methods and false discovery rates for microarrays , 2002, Genetic epidemiology.

[30]  Steven C. Lawlor,et al.  GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways , 2002, Nature Genetics.

[31]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[32]  Ji Zhu,et al.  Improved centroids estimation for the nearest shrunken centroid classifier , 2007, Bioinform..

[33]  G. Parmigiani,et al.  The Analysis of Gene Expression Data , 2003 .

[34]  John D. Storey The optimal discovery procedure: a new approach to simultaneous significance testing , 2007 .

[35]  Richard Charnigo,et al.  Omnibus testing and gene filtration in microarray data analysis , 2008 .

[36]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[37]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[38]  Jeffrey T. Leek,et al.  Gene expression EDGE : extraction and analysis of differential gene expression , 2006 .

[39]  Z. Q. John Lu Bayesian Inference for Gene Expression and Proteomics , 2007 .

[40]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[41]  A. Galecki,et al.  Interpretation, design, and analysis of gene array expression experiments. , 2001, The journals of gerontology. Series A, Biological sciences and medical sciences.

[42]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[43]  C. Stein Confidence Sets for the Mean of a Multivariate Normal Distribution , 1962 .

[44]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[45]  Robert Tibshirani,et al.  Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data , 2013, Statistical methods in medical research.

[46]  C M Kendziorski,et al.  On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles , 2003, Statistics in medicine.

[47]  P. Brown,et al.  Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[49]  Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis. , 2010, Biostatistics.

[50]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[51]  Baolin Wu,et al.  Differential gene expression detection and sample classification using penalized linear regression models , 2006, Bioinform..

[52]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[53]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[54]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[55]  Y. Chen,et al.  Ratio-based decisions and the quantitative analysis of cDNA microarray images. , 1997, Journal of biomedical optics.

[56]  John Quackenbush,et al.  Microarray gene expression data analysis - a beginner's guide , 2003 .

[57]  K. Miura,et al.  Quantitative assessment of DNA microarrays--comparison with Northern blot analyses. , 2001, Genomics.

[58]  Alessio Farcomeni,et al.  More Powerful Control of the False Discovery Rate Under Dependence , 2006, Stat. Methods Appl..

[59]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[60]  Kevin G Becker,et al.  Transcriptional Profiling of Aging in Human Muscle Reveals a Common Aging Signature , 2006, PLoS genetics.

[61]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[62]  May D. Wang,et al.  GoMiner: a resource for biological interpretation of genomic and proteomic data , 2003, Genome Biology.

[63]  Scott L. Zeger,et al.  The Analysis of Gene Expression Data: Methods and Software , 2013 .

[64]  D. Damian,et al.  Statistical concerns about the GSEA procedure , 2004, Nature Genetics.

[65]  L. Penland,et al.  Use of a cDNA microarray to analyse gene expression patterns in human cancer , 1996, Nature Genetics.

[66]  Kevin R Coombes,et al.  Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[67]  Sheng Zhong,et al.  ChipInfo: software for extracting gene annotation and gene ontology information for microarray analysis , 2003, Nucleic Acids Res..

[68]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[69]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[70]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[71]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[72]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[73]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[74]  Richard Charnigo,et al.  Contaminated normal modeling with application to microarray data analysis , 2010 .

[75]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[76]  Joseph P. Romano,et al.  Control of the false discovery rate under dependence using the bootstrap and subsampling , 2008 .

[77]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[78]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[79]  Yudong D. He,et al.  Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer , 2001, Nature Biotechnology.

[80]  Xuegong Zhang,et al.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data , 2010, Bioinform..

[81]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[82]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[83]  G. W. Hatfield,et al.  Global gene expression profiling in Escherichia coli K12. The effects of integration host factor. , 2000, The Journal of biological chemistry.

[84]  John D. Storey A direct approach to false discovery rates , 2002 .

[85]  J. Shendure The beginning of the end for microarrays? , 2008, Nature Methods.

[86]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[87]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[88]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[89]  K. Coombes,et al.  Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology , 2009, 1010.1092.

[90]  Eivind Hovig,et al.  Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data , 2003, BMC Bioinformatics.

[91]  Joaquín Dopazo,et al.  Gene set-based analysis of polymorphisms: finding pathways or biological processes associated to traits in genome-wide association studies , 2009, Nucleic Acids Res..

[92]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[93]  Robert Tibshirani,et al.  TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. , 2008, The annals of applied statistics.

[94]  Thomas Lengauer,et al.  Statistical Applications in Genetics and Molecular Biology Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data , 2011 .

[95]  Wei Pan,et al.  Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data , 2007, Bioinform..

[96]  Jianqing Fan,et al.  Journal of the American Statistical Association Estimating False Discovery Proportion under Arbitrary Covariance Dependence Estimating False Discovery Proportion under Arbitrary Covariance Dependence , 2022 .

[97]  Marina Vannucci,et al.  Variable selection in clustering via Dirichlet process mixture models , 2006 .

[98]  P. Park,et al.  Discovering statistically significant pathways in expression profiling studies. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[99]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[100]  Suhua Chang,et al.  i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study , 2010, Nucleic Acids Res..

[101]  V. Arango,et al.  Using the Gene Ontology for Microarray Data Mining: A Comparison of Methods and Application to Age Effects in Human Prefrontal Cortex , 2004, Neurochemical Research.

[102]  S. Dudoit,et al.  Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. , 2000, Genome research.

[103]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[104]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[105]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[106]  P. Müller,et al.  A Bayesian mixture model for differential gene expression , 2005 .

[107]  Ernst Wit,et al.  Statistics for Microarrays : Design, Analysis and Inference , 2004 .

[108]  R. Tibshirani,et al.  Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. , 2004, The New England journal of medicine.

[109]  G. Celeux,et al.  Variable Selection for Clustering with Gaussian Mixture Models , 2009, Biometrics.

[110]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[111]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[112]  Shankar Subramaniam,et al.  Variance-modeled posterior inference of microarray data: detecting gene-expression changes in 3T3-L1 adipocytes , 2004, Bioinform..

[113]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[114]  M. Ko,et al.  Genome-wide expression profiling of mid-gestation placenta and embryo using a 15,000 mouse developmental cDNA microarray. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[115]  Jeffrey T Leek,et al.  The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. , 2007, Biostatistics.

[116]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[117]  Susmita Datta,et al.  Empirical Bayes screening of many p-values with applications to microarray studies , 2005, Bioinform..

[118]  Yoel Sadovsky,et al.  Incorporation of gene-specific variability improves expression analysis using high-density DNA microarrays , 2003, BMC Biology.

[119]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[120]  M. Oh,et al.  Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. , 2001, Nucleic acids research.

[121]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[122]  M. Goldstein Bayesian analysis of regression problems , 1976 .

[123]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[124]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[125]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[126]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[127]  M. Vannucci,et al.  Bayesian variable selection in clustering high-dimensional data with substructure , 2008 .

[128]  Yudi Pawitan,et al.  Estimation of false discovery proportion under general dependence , 2006, Bioinform..

[129]  Ji Zhu,et al.  Variable Selection for Model‐Based High‐Dimensional Clustering and Its Application to Microarray Data , 2008, Biometrics.

[130]  Richard Simon,et al.  A random variance model for detection of differential gene expression in small microarray experiments , 2003, Bioinform..

[131]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.

[132]  H. Steven Wiley,et al.  Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling , 2011, Bioinform..

[133]  Sorin Drăghici,et al.  Statistics and Data Analysis for Microarrays Using R and Bioconductor , 2016 .

[134]  M. Newton,et al.  Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis , 2007, 0708.4350.

[135]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[136]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[137]  Wei Pan,et al.  Semi-supervised learning via penalized mixture model with application to microarray sample classification , 2006, Bioinform..

[138]  David M. Rocke,et al.  A Model for Measurement Error for Gene Expression Arrays , 2001, J. Comput. Biol..

[139]  Terence P. Speed,et al.  Quality Assessment for Short Oligonucleotide Microarray Data , 2007, Technometrics.

[140]  Marina Vannucci,et al.  Bayesian Variable Selection in Multinomial Probit Models to Identify Molecular Signatures of Disease Stage , 2004, Biometrics.

[141]  T. Dickhaus,et al.  Dependency and false discovery rate: Asymptotics , 2007, 0710.3171.

[142]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[143]  Steven C. Lawlor,et al.  MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data , 2003, Genome Biology.