Leveraging Hierarchical Population Structure in Discrete Association Studies

Population structure can confound the identification of correlations in biological data. Such confounding has been recognized in multiple biological disciplines, resulting in a disparate collection of proposed solutions. We examine several methods that correct for confounding on discrete data with hierarchical population structure and identify two distinct confounding processes, which we call coevolution and conditional influence. We describe these processes in terms of generative models and show that these generative models can be used to correct for the confounding effects. Finally, we apply the models to three applications: identification of escape mutations in HIV-1 in response to specific HLA-mediated immune pressure, prediction of coevolving residues in an HIV-1 peptide, and a search for genotypes that are associated with bacterial resistance traits in Arabidopsis thaliana. We show that coevolution is a better description of confounding in some applications and conditional influence is better in others. That is, we show that no single method is best for addressing all forms of confounding. Analysis tools based on these models are available on the internet as both web based applications and downloadable source code at http://atom.research.microsoft.com/bio/phylod.aspx.

[1]  D. Heckerman,et al.  Founder Effects in the Assessment of HIV Polymorphisms and HLA Allele Associations , 2007, Science.

[2]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[3]  Lon R Cardon,et al.  Genome-wide association: a promising start to a long race. , 2006, Trends in genetics : TIG.

[4]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[5]  David J Balding,et al.  Logistic regression protects against population structure in genetic association studies. , 2005, Genome research.

[6]  Uwe Tessmer,et al.  Solution Structure of the Human Immunodeficiency Virus Type 1 p6 Protein* , 2005, Journal of Biological Chemistry.

[7]  Keyan Zhao,et al.  Genome-Wide Association Mapping in Arabidopsis Identifies Previously Known Flowering Time and Pathogen Resistance Genes , 2005, PLoS genetics.

[8]  J. Pritchard,et al.  Confounding from Cryptic Relatedness in Case-Control Association Studies , 2005, PLoS genetics.

[9]  Elizabeth L. Ogburn,et al.  Demonstrating stratification in a European American population , 2005, Nature Genetics.

[10]  W. Atchley,et al.  Networks of coevolving sites in structural and functional domains of serpin proteins. , 2005, Molecular biology and evolution.

[11]  A. Horovitz,et al.  Detection and reduction of evolutionary noise in correlated mutation analysis. , 2005, Protein engineering, design & selection : PEDS.

[12]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[13]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[14]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[15]  Keyan Zhao,et al.  Genome-wide association mapping in Arabidopsis thaliana identifies previously known genes responsible for variation in flowering time and pathogen resistance , 2005 .

[16]  Birgir Hrafnkelsson,et al.  An Icelandic example of the impact of population structure on association studies , 2005, Nature Genetics.

[17]  Bette Korber,et al.  Dominant influence of HLA-B in mediating the potential co-evolution of HIV and HLA , 2004, Nature.

[18]  D. Penny Inferring Phylogenies.—Joseph Felsenstein. 2003. Sinauer Associates, Sunderland, Massachusetts. , 2004 .

[19]  R. Aldrich,et al.  Influence of conservation on calculations of amino acid covariance in multiple sequence alignments , 2004, Proteins.

[20]  P. Donnelly,et al.  The effects of human population structure on large genetic association studies , 2004, Nature Genetics.

[21]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[22]  L. Peltonen,et al.  Evidence for allelic association on chromosome 3q25–27 in families with autism spectrum disorders originating from a subisolate of Finland , 2003, Molecular Psychiatry.

[23]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Gürol M. Süel,et al.  Evolutionarily conserved networks of residues mediate allosteric communication in proteins , 2003, Nature Structural Biology.

[25]  C. Moore,et al.  Evidence of HIV-1 Adaptation to HLA-Restricted Immune Responses at a Population Level , 2002, Science.

[26]  P. Oefner,et al.  The extent of linkage disequilibrium in Arabidopsis thaliana , 2002, Nature Genetics.

[27]  William H. Press,et al.  Numerical recipes in C , 2002 .

[28]  L. Wasserman,et al.  Genomic control, a new approach to genetic-based association studies. , 2001, Theoretical population biology.

[29]  Edward S. Buckler,et al.  Dwarf8 polymorphisms associate with variation in flowering time , 2001, Nature Genetics.

[30]  Jonathan D. G. Jones,et al.  Plant pathogens and integrated defence responses to infection , 2001, Nature.

[31]  G A Satten,et al.  Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. , 2001, American journal of human genetics.

[32]  Stefan M. Larson,et al.  Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. , 2000, Journal of molecular biology.

[33]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[34]  N. Risch Searching for genetic determinants in the new millennium , 2000, Nature.

[35]  W. Atchley,et al.  Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[36]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[37]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[38]  W R Taylor,et al.  Coevolving protein residues: maximum likelihood identification and relationship to structure. , 1999, Journal of molecular biology.

[39]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[40]  W. Taylor,et al.  Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. , 1997, Protein engineering.

[41]  W. Bruno Modeling residue usage in aligned protein sequences via maximum likelihood. , 1996, Molecular biology and evolution.

[42]  J. Witte,et al.  Genetic dissection of complex traits , 1996, Nature Genetics.

[43]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[44]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[45]  K. Hatrick,et al.  Compensating changes in protein multiple sequence alignments. , 1994, Protein engineering.

[46]  M. Pagel Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters , 1994, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[47]  E. Neher How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[48]  A. Lapedes,et al.  Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[49]  M Quinton,et al.  Estimation of effects of single genes on quantitative traits. , 1992, Journal of animal science.

[50]  M. Pagel,et al.  The comparative method in evolutionary biology , 1991 .

[51]  Mark R. Conaway,et al.  A random effects model for binary data , 1990 .

[52]  J. Felsenstein Phylogenies and the Comparative Method , 1985, The American Naturalist.

[53]  M. Ridley The explanation of organic diversity : the comparative method and adaptations for mating , 1983 .