Searching Genome-wide Disease Association Through SNP Data

Taking the advantage of the high-throughput Single Nucleotide Polymorphism (SNP) genotyping technology, Genome-Wide Association Studies (GWASs) are regarded holding promise for unravelling complex relationships between genotype and phenotype. GWASs aim to identify genetic variants associated with disease by assaying and analyzing hundreds of thousands of SNPs. Traditional single-locus-based and two-locus-based methods have been standardized and led to many interesting findings. Recently, a substantial number of GWASs indicate that, for most disorders, joint genetic effects (epistatic interaction) across the whole genome are broadly existing in complex traits. At present, identifying high-order epistatic interactions from GWASs is computationally and methodologically challenging. My dissertation research focuses on the problem of searching genome-wide association with considering three frequently encountered scenarios, i.e. one case one control, multicases multi-controls, and Linkage Disequilibrium (LD) block structure. For the first scenario, we present a simple and fast method, named DCHE, using dynamic clustering. Also, we design two methods, a Bayesian inference based method and a heuristic method, to detect genome-wide multi-locus epistatic interactions on multiple diseases. For the last scenario, we propose a block-based Bayesian approach to model the LD and conditional disease association simultaneously. Experimental results on both synthetic and real GWAS datasets show that the proposed methods improve the detection accuracy of disease-specific associations and lessen the computational cost compared with current popular methods. INDEX WORDS: Algorithm, GWAS, SNP analysis, epistatic interactions, epistasis, clustering, Bayesian Theory, Markov Chain Monte Carlo SEARCHING GENOME-WIDE DISEASE ASSOCIATION THROUGH SNP DATA

[1]  Li Ma,et al.  Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies , 2008, BMC Bioinformatics.

[2]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[3]  Heping Zhang,et al.  A forest-based approach to identifying gene and gene–gene interactions , 2007, Proceedings of the National Academy of Sciences.

[4]  Dan Liu,et al.  Performance analysis of novel methods for detecting epistasis , 2011, BMC Bioinformatics.

[5]  G. Mendel,et al.  Mendel's Principles of Heredity , 1910, Nature.

[6]  Jing Zhang,et al.  BLOCK-BASED BAYESIAN EPISTASIS ASSOCIATION MAPPING WITH APPLICATION TO WTCCC TYPE 1 DIABETES DATA. , 2011, The annals of applied statistics.

[7]  Cheng Soon Ong,et al.  GWIS - model-free, fast and exhaustive search for epistatic interactions in case-control GWAS , 2013, BMC Genomics.

[8]  J. Piriyapongsa,et al.  iLOCi: a SNP interaction prioritization technique for detecting epistasis in genome-wide association studies , 2012, BMC Genomics.

[9]  Quan Long,et al.  Detecting disease-associated genotype patterns , 2009, BMC Bioinformatics.

[10]  Qiang Yang,et al.  SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies , 2009, Bioinform..

[11]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[12]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[13]  M. Barmada,et al.  Identifying genetic interactions in genome‐wide data using Bayesian networks , 2010, Genetic epidemiology.

[14]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[15]  Holger Schwender,et al.  Identification of SNP interactions using logic regression. , 2008, Biostatistics.

[16]  Kyung-Ah Sohn,et al.  Fast detection of high-order epistatic interactions in genome-wide association studies using information theoretic measure , 2014, Comput. Biol. Chem..

[17]  Robert I. Lechler,et al.  HLA in health and disease , 2000 .

[18]  Guimei Liu,et al.  An empirical comparison of several recent epistatic interaction detection methods , 2011, Bioinform..

[19]  Obi L. Griffith,et al.  ORegAnno: an open-access community-driven resource for regulatory annotation , 2007, Nucleic Acids Res..

[20]  Yi Pan,et al.  Cloud Computing for De Novo Metagenomic Sequence Assembly , 2013, ISBRA.

[21]  Yi Pan,et al.  DAM: A Bayesian Method for Detecting Genome-wide Associations on Multiple Diseases , 2015, ISBRA.

[22]  Yi Pan,et al.  Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering , 2014, BMC Bioinformatics.

[23]  A. Daly,et al.  Genome-wide association studies in pharmacogenomics , 2010, Nature Reviews Genetics.

[24]  Yi Pan,et al.  DNA AS X: An Information-Coding-Based Model to Improve the Sensitivity in Comparative Gene Analysis , 2015, ISBRA.

[25]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[26]  Romdhane Rekaya,et al.  AntEpiSeeker: detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm , 2010, BMC Research Notes.

[27]  Jeffrey R. Kilpatrick,et al.  Methods for detecting multi-locus genotype-phenotype association , 2010 .

[28]  K. Christensen,et al.  What genome-wide association studies can do for medicine. , 2007, The New England journal of medicine.

[29]  Jun S. Liu,et al.  Bayesian inference of epistatic interactions in case-control studies , 2007, Nature Genetics.

[30]  Chatchawit Aporntewan,et al.  Gene hunting of the Genetic Analysis Workshop 16 rheumatoid arthritis data using rough set theory , 2009, BMC proceedings.

[31]  Xiang Zhang,et al.  TEAM: efficient two-locus epistasis tests in human genome-wide association study , 2010, Bioinform..

[32]  Yang Liu,et al.  Genome-Wide Interaction-Based Association Analysis Identified Multiple New Susceptibility Loci for Common Diseases , 2011, PLoS genetics.

[33]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[34]  Marylyn D. Ritchie,et al.  GPNN: Power studies and applications of a neural network method for detecting gene-gene interactions in studies of human disease , 2006, BMC Bioinformatics.

[35]  Can Yang,et al.  Mediated pleiotropy between psychiatric disorders and autoimmune disorders revealed by integrative analysis of multiple GWAS , 2015, bioRxiv.

[36]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[37]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[38]  Juan Liu,et al.  Discovering negative correlated gene sets from integrative gene expression data for cancer prognosis , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[39]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[40]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.

[41]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[42]  J. Ott,et al.  Complement Factor H Polymorphism in Age-Related Macular Degeneration , 2005, Science.

[43]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[44]  R. Jiang,et al.  Epistatic Module Detection for Case-Control Studies: A Bayesian Model with a Gibbs Sampling Strategy , 2009, PLoS genetics.

[45]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[46]  N. Chatterjee,et al.  Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. , 2006, American journal of human genetics.

[47]  Can Yang,et al.  GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies , 2011, Bioinform..

[48]  N. Cook,et al.  Tree and spline based association analysis of gene–gene interaction models for ischemic stroke , 2004, Statistics in medicine.

[49]  Qiang Yang,et al.  MegaSNPHunter: a learning approach to detect disease predisposition SNPs and high level interactions in genome wide association study , 2009, BMC Bioinformatics.

[50]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[51]  Matsuda,et al.  Physical nature of higher-order mutual information: intrinsic correlations and frustration , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[52]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[53]  M. Steinbach,et al.  High-Order SNP Combinations Associated with Complex Diseases: Efficient Discovery, Statistical Power and Functional Interactions , 2012, PloS one.

[54]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[55]  Ingo Ruczinski,et al.  Identifying interacting SNPs using Monte Carlo logic regression , 2005, Genetic epidemiology.

[56]  Vipin Kumar,et al.  Using Constraints to Generate and Explore Higher Order Discriminative Patterns , 2011, PAKDD.

[57]  Tim Becker,et al.  INTERSNP: genome-wide interaction analysis guided by a priori information , 2009, Bioinform..

[58]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[59]  Todd Holden,et al.  A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. , 2006, Journal of theoretical biology.

[60]  David M. Evans,et al.  Two-Stage Two-Locus Models in Genome-Wide Association , 2006, PLoS genetics.

[61]  Jing Zhang,et al.  High-Order Interactions in Rheumatoid Arthritis Detected by Bayesian Method using Genome-Wide Association Studies Data , 2012 .

[62]  David M. Herrington,et al.  An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions , 2009, Bioinform..

[63]  Mohd Fareed,et al.  Single nucleotide polymorphism in genome-wide association of human population: A tool for broad spectrum service , 2013 .

[64]  Taesung Park,et al.  A novel method to identify high order gene-gene interactions in genome-wide association studies: Gene-based MDR , 2012, BMC Bioinformatics.

[65]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[66]  M. Rieder,et al.  A genome-wide scan for common genetic variants with a large influence on warfarin maintenance dose. , 2008, Blood.

[67]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[68]  Tao Jiang,et al.  Detecting genome-wide epistases based on the clustering of relatively frequent items , 2012, Bioinform..

[69]  Xiang Zhang,et al.  COE: A General Approach for Efficient Genome-Wide Two-Locus Epistasis Test in Disease Association Study , 2009, RECOMB.

[70]  Tian Zheng,et al.  Backward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs , 2006, Human Heredity.

[71]  D. Anastassiou Computational analysis of the synergy among multiple interacting genes , 2007, Molecular systems biology.

[72]  Sean D. Mooney,et al.  Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis , 2005, Briefings Bioinform..

[73]  J. Lehár,et al.  High-order combination effects and biological robustness , 2008, Molecular systems biology.

[74]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[75]  Xiang Zhang,et al.  Fastanova: an efficient algorithm for genome-wide association study , 2008, KDD.

[76]  Vineet Bafna,et al.  RAPID detection of gene-gene interactions in genome-wide association studies , 2010, Bioinform..

[77]  William Shannon,et al.  Detecting epistatic interactions contributing to quantitative traits , 2004, Genetic epidemiology.

[78]  J. DiStefano,et al.  Technological issues and experimental design of gene association studies. , 2011, Methods in molecular biology.

[79]  R. Fisher On the Interpretation of χ 2 from Contingency Tables , and the Calculation of P Author , 2022 .

[80]  Qiang Yang,et al.  BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies , 2010, American journal of human genetics.

[81]  J. Gilbert,et al.  Complement Factor H Variant Increases the Risk of Age-Related Macular Degeneration , 2005, Science.

[82]  Yuanke Zhang,et al.  EpiMiner: A three-stage co-information based method for detecting and visualizing epistatic interactions , 2014, Digit. Signal Process..

[83]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[84]  Qiang Yang,et al.  Predictive rule inference for epistatic interaction detection in genome-wide association studies , 2010, Bioinform..

[85]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[86]  David V Conti,et al.  A testing framework for identifying susceptibility genes in the presence of epistasis. , 2006, American journal of human genetics.

[87]  Peter K Gregersen,et al.  Genetic risk factors for rheumatoid arthritis differ in Caucasian and Korean populations. , 2009, Arthritis and rheumatism.

[88]  M. Olivier A haplotype map of the human genome , 2003, Nature.