Detecting differentially expressed genes while controlling the false discovery rate for microarray data

Microarray is an important technology which enables people to investigate the expression levels of thousands of genes at the same time. One common goal of microarray data analysis is to detect differentially expressed genes while controlling the false discovery rate. This dis-sertation consists with four papers written to address this goal. The dissertation is organized as follows: In Chapter 1, a brief introduction of the Affymetrix GeneChip microarray technology is provided. The concept of differentially expressed genes and the definition of the false discovery rate are also introduced. In Chapter 2, a literature review of the related works on this matter is provided. In Chapter 3, a t-mixture model based method is proposed to detect differentially expressed genes. In Chapter 4, a t-mixture model based false discovery rate estimator is proposed to overcome several problems of the current empirical false discovery rate estimators. In Chapter 5, a two-step false discovery rate estimation procedure is proposed to correct the over-estimation of the false discovery rate caused by differentially expressed genes. In Chapter 6, a novel estimator is developed to estimate the proportion of equivalently expressed genes, which is an important component of the false discovery rate estimators. In Chapter 7, a summary of the dissertation will be given along with some possible directions for the future work. 3 Acknowledgements The completion of this dissertation is impossible without the support from many people. I would like to give my deepest gratitude to my advisor Dr. Shunpu Zhang. He directed me into the area of my dissertation, gave me insightful advices, and encouragingly supported my ideas. I would also like to thank my co-advisor Dr. Stephen D. Kachman for always being there to listen and discuss. I learned a lot from his way of thinking. I would like to thank Dr. Kent M. Eskridge and Dr. Istvan Ladunga for serving on my PhD. supervisory committee. Their careful proofreading of the dissertation proposal helps me improve my writing skills and I am grateful to them for holding me to a high research standard. providing the financial support to me, which was crucial for my PhD program. I want to give a special thanks to Dr. Yuannan Xia for letting me participate in his microarray experiment. I dedicate this work to my parents, my fiancee, and our family who have been supportive all the time.

[1]  C M Kendziorski,et al.  On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles , 2003, Statistics in medicine.

[2]  W. Pan,et al.  Model-based cluster analysis of microarray gene-expression data , 2002, Genome Biology.

[3]  D. Botstein,et al.  DNA microarray analysis of gene expression in response to physiological and genetic changes that affect tryptophan metabolism in Escherichia coli. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Yinglei Lai,et al.  A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data. , 2006, Biostatistics.

[5]  Per Broberg,et al.  Ranking genes with respect to differential expression , 2002, Genome Biology.

[6]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  Thierry Moreau,et al.  A simple procedure for estimating the false discovery rate , 2005, Bioinform..

[9]  XU GUO,et al.  Using Weighted Permutation Scores to Detect Differential Gene Expression with Microarray Data , 2005, J. Bioinform. Comput. Biol..

[10]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[11]  Shunpu Zhang,et al.  An Improved Nonparametric Approach for Detecting Differentially Expressed Genes with Replicated Microarray Data , 2007, Statistical applications in genetics and molecular biology.

[12]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[13]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[14]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[15]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[16]  D. Rubin,et al.  ML ESTIMATION OF THE t DISTRIBUTION USING EM AND ITS EXTENSIONS, ECM AND ECME , 1999 .

[17]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[18]  Robert Tibshirani,et al.  SAM “Significance Analysis of Microarrays” Users guide and technical document , 2002 .

[19]  Laurent Bordes,et al.  Semiparametric Estimation of a Two-component Mixture Model where One Component is known , 2006 .

[20]  L. Bordes,et al.  SEMIPARAMETRIC ESTIMATION OF A TWO-COMPONENT MIXTURE MODEL , 2006, math/0607812.

[21]  Shuo Jiao,et al.  On correcting the overestimation of the permutation-based false discovery rate estimator , 2008, Bioinform..

[22]  PanWei,et al.  A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data , 2005 .

[23]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[24]  D. Ruppert,et al.  Exploring the Information in p‐Values for the Analysis and Planning of Multiple‐Test Experiments , 2007, Biometrics.

[25]  B. Lindqvist,et al.  Estimating the proportion of true null hypotheses, with application to DNA microarray data , 2005 .

[26]  Shunpu Zhang,et al.  A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance , 2007, BMC Bioinformatics.

[27]  Wei Pan,et al.  A mixture model approach to detecting differentially expressed genes with microarray data , 2003, Functional & Integrative Genomics.

[28]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[29]  Geoffrey J. McLachlan,et al.  A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays , 2006, Bioinform..

[30]  Wei Pan,et al.  Modified Nonparametric Approaches to Detecting Differentially Expressed Genes in Replicated Microarray Experiments , 2003, Bioinform..

[31]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[32]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[33]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[34]  Sandrine Dudoit,et al.  Multiple Testing Procedures: the multtest Package and Applications to Genomics , 2005 .

[35]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[36]  D. Hunter,et al.  Inference for mixtures of symmetric distributions , 2007, 0708.0499.

[37]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[38]  Laurent Bordes,et al.  Semiparametric two-component mixture model with a known component: An asymptotically normal estimator , 2010 .

[39]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Wei Pan,et al.  Gene expression A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data , 2005 .

[41]  Cheng Cheng,et al.  Improving false discovery rate estimation , 2004, Bioinform..

[42]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[43]  J. Thomas,et al.  An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. , 2001, Genome research.

[44]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[45]  P. Deb Finite Mixture Models , 2008 .

[46]  Shuo Jiao,et al.  The t-mixture model approach for detecting differentially expressed genes in microarrays , 2008, Functional & Integrative Genomics.

[47]  A. Khodursky,et al.  Evolutionary genomics of ecological specialization. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[49]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[50]  Wei Pan,et al.  On the Use of Permutation in and the Performance of A Class of Nonparametric Methods to Detect Differential Gene Expression , 2003, Bioinform..