Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis

We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more “continuous,” as in isolation-by-distance models.

[1]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[2]  D. Rubin,et al.  The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence , 1994 .

[3]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[4]  Geoffrey E. Hinton,et al.  The EM algorithm for mixtures of factor analyzers , 1996 .

[5]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[6]  J. Pritchard,et al.  Use of unlinked genetic markers to detect population stratification in association studies. , 1999, American journal of human genetics.

[7]  Michael E. Tipping The Relevance Vector Machine , 1999, NIPS.

[8]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[9]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[10]  P. Donnelly,et al.  Case-control studies of association in structured or admixed populations. , 2001, Theoretical population biology.

[11]  Wray L. Buntine Variational Extensions to EM and Multinomial PCA , 2002, ECML.

[12]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[13]  Xiaofeng Zhu,et al.  Association mapping, using a mixture model for complex traits , 2002, Genetic epidemiology.

[14]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[15]  John F. Canny,et al.  Collaborative filtering with privacy via factor analysis , 2002, SIGIR '02.

[16]  Christopher Bishop,et al.  Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics , 2003 .

[17]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[18]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[19]  Michael E. Tipping,et al.  Fast Marginal Likelihood Maximisation for Sparse Bayesian Models , 2003 .

[20]  E. Fokoue Stochastic Determination of the Intrinsic Structure in Bayesian Factor Analysis , 2004 .

[21]  Natalie,et al.  Genetic Structure of the Purebred Domestic Dog , 2004 .

[22]  Michael A. West,et al.  BAYESIAN MODEL ASSESSMENT IN FACTOR ANALYSIS , 2004 .

[23]  S. Pääbo,et al.  Evidence for gradients of human genetic diversity within and among continents. , 2004, Genome research.

[24]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[25]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[26]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[27]  Lorenz Wernisch,et al.  Factor analysis for gene regulatory networks and transcription factor activity profiles , 2007, BMC Bioinformatics.

[28]  N. Risch,et al.  Reconstructing genetic ancestry blocks in admixed individuals. , 2006, American journal of human genetics.

[29]  Carlos M. Carvalho,et al.  Sparse Statistical Modelling in Gene Expression Genomics , 2006 .

[30]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[31]  D. Conrad,et al.  A worldwide survey of haplotype variation and linkage disequilibrium in the human genome , 2006, Nature Genetics.

[32]  James O. Berger Statistical and Applied Mathematical Sciences Institute (SAMSI) , 2006 .

[33]  M. Stephens,et al.  Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban , 2007, Proceedings of the National Academy of Sciences.

[34]  Christian Gieger,et al.  Correlation between Genetic and Geographic Structure in Europe , 2008, Current Biology.

[35]  John Novembre,et al.  The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. , 2008, American journal of human genetics.

[36]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[37]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[38]  M. Stephens,et al.  Interpreting principal component analyses of spatial population genetic variation , 2008, Nature Genetics.

[39]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[40]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[41]  Alkes L. Price,et al.  Reconstructing Indian Population History , 2009, Nature.

[42]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[43]  G. McVean A Genealogical Interpretation of Principal Components Analysis , 2009, PLoS genetics.

[44]  Peter Cheeseman,et al.  Bayesian Methods for Adaptive Models , 2011 .