Empirical Bayes Analysis of Single Nucleotide Polymorphisms Empirical Bayes Analysis of Single Nucleotide Polymorphisms

BackgroundAn important goal of whole-genome studies concerned with single nucleotide polymorphisms (SNPs) is the identification of SNPs associated with a covariate of interest such as the case-control status or the type of cancer. Since these studies often comprise the genotypes of hundreds of thousands of SNPs, methods are required that can cope with the corresponding multiple testing problem. For the analysis of gene expression data, approaches such as the empirical Bayes analysis of microarrays have been developed particularly for the detection of genes associated with the response. However, the empirical Bayes analysis of microarrays has only been suggested for binary responses when considering expression values, i.e. continuous predictors.ResultsIn this paper, we propose a modification of this empirical Bayes analysis that can be used to analyze high-dimensional categorical SNP data. This approach along with a generalized version of the original empirical Bayes method are available in the R package siggenes version 1.10.0 and later that can be downloaded from http://www.bioconductor.org.ConclusionAs applications to two subsets of the HapMap data show, the empirical Bayes analysis of microarrays cannot only be used to analyze continuous gene expression data, but also be applied to categorical SNP data, where the response is not restricted to be binary. In association studies in which typically several ten to a few hundred SNPs are considered, our approach can furthermore be employed to test interactions of SNPs. Moreover, the posterior probabilities resulting from the empirical Bayes analysis of (prespecified) interactions/genotypes can also be used to quantify the importance of these interactions.

[1]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[2]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[3]  Holger Schwender,et al.  Modifying Microarray Analysis Methods for Categorical Data - SAM and PAM for SNPs , 2004, GfKl.

[4]  M. Wand Data-Based Choice of Histogram Bin Width , 1997 .

[5]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[6]  I. Pigeot,et al.  Resampling-Based Multiple Testing. Examples and Methods for p-Value Adjustment: Peter H. Westfall and S. Stanley Young (1993): New York: Wiley, ISBN 0-471-55761-7, pp.340, $ 59.95 , 1995 .

[7]  Olivier Scaillet,et al.  Density estimation using inverse and reciprocal inverse Gaussian kernels , 2004 .

[8]  Holger Schwender,et al.  Statistical analysis of genotype and gene expression data , 2007 .

[9]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[10]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[11]  S. Garte,et al.  Metabolic susceptibility genes as cancer risk factors: time for a reassessment? , 2001, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[12]  John D. Storey,et al.  Statistical Significance for Genome-Wide Studies , 2003 .

[13]  T. Reich,et al.  A perspective on epistasis: limits of models displaying no main effect. , 2002, American journal of human genetics.

[14]  Prachi Shah,et al.  BMC Bioinformatics Methodology article Comparison of mode estimation methods and application in molecular clock analysis , 2003 .

[15]  David R. Bickel,et al.  Robust Estimators of the Mode and Skewness of Continuous Data , 2002 .

[16]  Ingo Ruczinski,et al.  Exploring interactions in high-dimensional genomic data: an overview of logic regression, with applications , 2004 .

[17]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[18]  Holger Schwender,et al.  A pilot study on the application of statistical classification procedures to molecular epidemiological data. , 2004, Toxicology letters.

[19]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[21]  BRLMM : an Improved Genotype Calling Method for the GeneChip ® Human Mapping 500 K Array Set , 2006 .

[22]  Rafael A Irizarry,et al.  Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. , 2006, Biostatistics.

[23]  M. LeBlanc,et al.  Logic Regression , 2003 .

[24]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[25]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[26]  Holger Schwender,et al.  Identification of SNP interactions using logic regression. , 2008, Biostatistics.

[27]  John D. Storey A direct approach to false discovery rates , 2002 .

[28]  Yogendra P. Chaubey Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[29]  R. Tibshirani,et al.  Empirical bayes methods and false discovery rates for microarrays , 2002, Genetic epidemiology.

[30]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[31]  C Kooperberg,et al.  Sequence Analysis Using Logic Regression , 2001, Genetic epidemiology.

[32]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[33]  Terence P. Speed,et al.  Genome analysis A genotype calling algorithm for affymetrix SNP arrays , 2005 .

[34]  Song-xi Chen,et al.  Probability Density Function Estimation Using Gamma Kernels , 2000 .

[35]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[36]  R. Tibshirani,et al.  Using specially designed exponential families for density estimation , 1996 .

[37]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[38]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[39]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[40]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[41]  Carolin Strobl,et al.  Statistical Applications in Genetics and Molecular Biology Multiple Testing for SNP-SNP Interactions , 2007 .

[42]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[43]  John D. Storey,et al.  SAM Thresholding and False Discovery Rates for Detecting Differential Gene Expression in DNA Microarrays , 2003 .

[44]  J S Witte,et al.  Introduction: Analysis of Sequence Data and Population Structure , 2001, Genetic epidemiology.

[45]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[46]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[47]  K. Ickstadt,et al.  Identifying Interesting Genes with siggenes , 2006 .

[48]  D. W. Scott On optimal and data based histograms , 1979 .

[49]  Thomas Brüning,et al.  ERCC2 genotypes and a corresponding haplotype are linked with breast cancer risk in a German population. , 2004, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.