Identifying disease-associated SNP clusters via contiguous outlier detection

MOTIVATION Although genome-wide association studies (GWAS) have identified many disease-susceptibility single-nucleotide polymorphisms (SNPs), these findings can only explain a small portion of genetic contributions to complex diseases, which is known as the missing heritability. A possible explanation is that genetic variants with small effects have not been detected. The chance is < 8 that a causal SNP will be directly genotyped. The effects of its neighboring SNPs may be too weak to be detected due to the effect decay caused by imperfect linkage disequilibrium. Moreover, it is still challenging to detect a causal SNP with a small effect even if it has been directly genotyped. RESULTS In order to increase the statistical power when detecting disease-associated SNPs with relatively small effects, we propose a method using neighborhood information. Since the disease-associated SNPs account for only a small fraction of the entire SNP set, we formulate this problem as Contiguous Outlier DEtection (CODE), which is a discrete optimization problem. In our formulation, we cast the disease-associated SNPs as outliers and further impose a spatial continuity constraint for outlier detection. We show that this optimization can be solved exactly using graph cuts. We also employ the stability selection strategy to control the false positive results caused by imperfect parameter tuning. We demonstrate its advantage in simulations and real experiments. In particular, the newly identified SNP clusters are replicable in two independent datasets. AVAILABILITY The software is available at: http://bioinformatics.ust.hk/CODE.zip. CONTACT eeyu@ust.hk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  M. Boehnke,et al.  So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. , 2007, American journal of human genetics.

[2]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[3]  Xiaohui Xie,et al.  Split Bregman method for large scale fused Lasso , 2010, Comput. Stat. Data Anal..

[4]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[5]  Qiang Yang,et al.  BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies , 2010, American journal of human genetics.

[6]  Judy H. Cho,et al.  A Genome-Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene , 2006, Science.

[7]  Y. She,et al.  Thresholding-based iterative selection procedures for model selection and shrinkage , 2008, 0812.5061.

[8]  Eleazar Eskin,et al.  Rapid and Accurate Multiple Testing Correction and Power Estimation for Millions of Correlated Markers , 2009, PLoS genetics.

[9]  J. Friedman,et al.  On bagging and nonlinear estimation , 2007 .

[10]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[11]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[12]  Teri A Manolio,et al.  Genomewide association studies and assessment of the risk of disease. , 2010, The New England journal of medicine.

[13]  Chiara Sabatti,et al.  False discovery rate in linkage and association genome screens for complex disorders. , 2003, Genetics.

[14]  T. Hastie,et al.  SparseNet: Coordinate Descent With Nonconvex Penalties , 2011, Journal of the American Statistical Association.

[15]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[16]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[17]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[18]  Talin Haritunians,et al.  Variants in ZNF365 isoform D are associated with Crohn's disease , 2011, Gut.

[19]  Tariq Ahmad,et al.  Confirmation of the role of ATG16l1 as a Crohn's disease susceptibility gene , 2007, Inflammatory bowel diseases.

[20]  Kai Wang,et al.  Multiple testing in genome-wide association studies via hidden Markov models , 2009, Bioinform..

[21]  Ying Wang,et al.  Genomewide association study of leprosy. , 2009, The New England journal of medicine.

[22]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[23]  A. Rinaldo Properties and refinements of the fused lasso , 2008, 0805.0234.

[24]  Monya Baker,et al.  Genomics: The search for association , 2010, Nature.

[25]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[26]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[27]  Bradley Efron,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Rejoinder. , 2008, 0808.0572.

[28]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[29]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[30]  Daniel J Schaid,et al.  Linkage Disequilibrium Testing When Linkage Phase Is Unknown , 2004, Genetics.

[31]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[32]  Bradley Efron,et al.  Large-scale inference , 2010 .

[33]  Peter Donnelly,et al.  Quantifying the Underestimation of Relative Risks from Genome-Wide Association Studies , 2011, PLoS genetics.

[34]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[35]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[36]  Jason H. Moore,et al.  Missing heritability and strategies for finding the underlying causes of complex disease , 2010, Nature Reviews Genetics.

[37]  Stephen J. Goodswen,et al.  FunctSNP: an R package to link SNPs to functional knowledge and dbAutoMaker: a suite of Perl scripts to build SNP databases , 2010, BMC Bioinformatics.

[38]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[39]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[41]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[42]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[43]  B Müller-Myhsok,et al.  Rapid simulation of P values for product methods and multiple-testing adjustment in association studies. , 2005, American journal of human genetics.

[44]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[45]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[46]  M Schwab,et al.  NOD2 (CARD15) mutations in Crohn’s disease are associated with diminished mucosal α-defensin expression , 2004, Gut.