Efficient whole-genome association mapping using local phylogenies for unphased genotype data

MOTIVATION Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale datasets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome. RESULTS In this article, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological datasets. AVAILABILITY The software described in this article is available at http://www.daimi.au.dk/~mailund/Blossoc and distributed under the GNU General Public License.

[1]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[2]  Hamid Pezeshk,et al.  BMC Bioinformatics BioMed Central Methodology article Global haplotype partitioning for maximal associated SNP pairs , 2009 .

[3]  M. De Iorio,et al.  Bayesian logistic regression using a perfect phylogeny. , 2007, Biostatistics.

[4]  Maria De Iorio,et al.  Genetic Association Mapping via Evolution-Based Clustering of Haplotypes , 2007, PLoS genetics.

[5]  Sonja W. Scholz,et al.  Genome-wide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data , 2006, The Lancet Neurology.

[6]  M. Daly,et al.  Evaluating and improving power in whole-genome association studies using fixed marker sets , 2006, Nature Genetics.

[7]  A. Gylfason,et al.  A common variant associated with prostate cancer in European and African populations , 2006, Nature Genetics.

[8]  Yufeng Wu,et al.  Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms , 2007, RECOMB.

[9]  Thomas Mailund,et al.  CoaSim: A flexible environment for simulating genetic data under coalescent models , 2005, BMC Bioinformatics.

[10]  I. Măndoiu,et al.  Highly Scalable Genotype Phasing by Entropy Minimization , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Lon R Cardon,et al.  Evaluating coverage of genome-wide association studies , 2006, Nature Genetics.

[12]  C. Hoggart,et al.  Sequence-Level Population Simulations Over Large Genomic Regions , 2007, Genetics.

[13]  Eran Halperin,et al.  Haplotype reconstruction from genotype data using Imperfect Phylogeny , 2004, Bioinform..

[14]  D. Clayton,et al.  A genome-wide association study of nonsynonymous SNPs identifies a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region , 2006, Nature Genetics.

[15]  R. Marttila,et al.  Changing epidemiology of Parkinson's disease: Predicted effects of levodopa treatment , 1979, Acta neurologica Scandinavica.

[16]  R. Durbin,et al.  Mapping trait loci by use of inferred ancestral recombination graphs. , 2006, American journal of human genetics.

[17]  Thomas Mailund,et al.  Whole genome association mapping by incompatibilities and local perfect phylogenies , 2006, BMC Bioinformatics.

[18]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[19]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[20]  Richard M. Karp,et al.  The minimum-entropy set cover problem , 2005, Theor. Comput. Sci..

[21]  D. Balding,et al.  Fine mapping of disease genes via haplotype clustering , 2006, Genetic epidemiology.

[22]  Dan Gusfield,et al.  A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem , 2005, RECOMB.

[23]  Amar Mukherjee,et al.  An efficient algorithm for perfect phylogeny haplotyping , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[24]  Hannu Toivonen,et al.  TreeDT: tree pattern mining for gene mapping , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Michael Doebeli,et al.  Adaptive Diversification in Genes That Regulate Resource Use in Escherichia coli , 2007, PLoS genetics.

[26]  H Helenius,et al.  Changing epidemiology of Parkinson’s disease in southwestern Finland , 1999, Neurology.

[27]  Sebastian Zöllner,et al.  Coalescent-Based Association Mapping and Fine Mapping of Complex Trait Loci , 2005, Genetics.

[28]  Christian Gieger,et al.  A common genetic variant in the NOS1 regulator NOS1AP modulates cardiac repolarization , 2006, Nature Genetics.