Variational Inference of Population Structure in Large SNP Datasets

Tools for estimating population structure from genetic data are now used in a wide variety of applications in population genetics. However, inferring population structure in large modern data sets imposes severe computational challenges. Here, we develop efficient algorithms for approximate inference of the model underlying the STRUCTURE program using a variational Bayesian framework. Variational methods pose the problem of computing relevant posterior distributions as an optimization problem, allowing us to build on recent advances in optimization theory to develop fast inference tools. In addition, we propose useful heuristic scores to identify the number of populations represented in a dataset and a new hierarchical prior to detect weak population structure in the data. We test the variational algorithms on simulated data, and illustrate using genotype data from the CEPH-Human Genome Diversity Panel. The variational algorithms are almost two orders of magnitude faster than STRUCTURE and achieve accuracies comparable to those of ADMIXTURE. Furthermore, our results show that the heuristic scores for choosing model complexity provide a reasonable range of values for the number of populations represented in the data, with minimal bias towards detecting structure when it is very weak. Our algorithm, fastSTRUCTURE, is freely available online at http://pritchardlab.stanford.edu/structure.html.

[1]  M. Stephens,et al.  Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban , 2007, Proceedings of the National Academy of Sciences.

[2]  P. Donnelly,et al.  Case-control studies of association in structured or admixed populations. , 2001, Theoretical population biology.

[3]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[4]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[5]  N. Rosenberg distruct: a program for the graphical display of population structure , 2003 .

[6]  Joseph K. Pickrell,et al.  Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data , 2012, PLoS genetics.

[7]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[10]  Ych-chu Wang Molecular ecology , 1992, Journal of Northeast Forestry University.

[11]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[12]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[13]  M. Stephens,et al.  Scalable Variational Inference for Bayesian Variable Selection in Regression, and Its Accuracy in Genetic Association Studies , 2012 .

[14]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[15]  Alkes L. Price,et al.  Reconstructing Indian Population History , 2009, Nature.

[16]  Chris H Wiggins,et al.  Bayesian approach to network modularity. , 2007, Physical review letters.

[17]  W. Cresko,et al.  The population structure and recent colonization history of Oregon threespine stickleback determined using restriction‐site associated DNA‐sequencing , 2013, Molecular ecology.

[18]  L. Kadano More is the Same; Phase Transitions and Mean Field Theories , 2009 .

[19]  R. Varadhan,et al.  Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm , 2008 .

[20]  M. Stephens,et al.  Inferring weak population structure with the assistance of sample group information , 2009, Molecular ecology resources.

[21]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[22]  M. Feldman,et al.  Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation , 2008 .

[23]  Masa-aki Sato,et al.  Online Model Selection Based on the Variational Bayes , 2001, Neural Computation.

[24]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[25]  M. Stephens,et al.  Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis , 2010, PLoS genetics.

[26]  M. Stephens,et al.  Interpreting principal component analyses of spatial population genetic variation , 2008, Nature Genetics.

[27]  Benjamin A. Logsdon,et al.  A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis , 2010, BMC Bioinformatics.

[28]  Marcos Raydan,et al.  Relaxed Steepest Descent and Cauchy-Barzilai-Borwein Method , 2002, Comput. Optim. Appl..

[29]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[30]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.