Fine-Scale Estimation of Location of Birth from Genome-Wide Single-Nucleotide Polymorphism Data

Systematic nonrandom mating in populations results in genetic stratification and is predominantly caused by geographic separation, providing the opportunity to infer individuals’ birthplace from genetic data. Such inference has been demonstrated for individuals’ country of birth, but here we use data from the Northern Finland Birth Cohort 1966 (NFBC1966) to investigate the characteristics of genetic structure within a population and subsequently develop a method for inferring location to a finer scale. Principal component analysis (PCA) shows that while the first PCs are particularly informative for location, there is also location information in the higher-order PCs, but it cannot be captured by a linear model. We introduce a new method, pcLOCATE, which is able to exploit this information to improve the accuracy of location inference. pcLOCATE uses individuals’ PC values to estimate the probability of birth in each town and then averages over all towns to give an estimated longitude and latitude of birth using a fully Bayesian model. We apply pcLOCATE to the NFBC1966 data to estimate parental birthplace, testing with successively more PCs and finding the model with the top 23 PCs most accurate, with a median distance of 23 km between the estimated and the true location. pcLOCATE predicts the most recent residence of NFBC1966 individuals to a median distance of 47 km. We also apply pcLOCATE to Indian individuals from the London Life Sciences Prospective Population Study (LOLIPOP) data, and find that birthplace is predicated to a median distance of 54 km from the true location. A method with such accuracy is potentially valuable in population genetics and forensics.

[1]  M. Stephens,et al.  Interpreting principal component analyses of spatial population genetic variation , 2008, Nature Genetics.

[2]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[3]  Mark N. Wass,et al.  Genetic variation in SCN10A influences cardiac conduction , 2010, Nature Genetics.

[4]  L. Cavalli-Sforza,et al.  Demic expansions and human evolution , 1993, Science.

[5]  W. Gellert,et al.  The VNR concise encyclopedia of mathematics , 1977 .

[6]  C. Hoggart,et al.  Genome-wide association analysis of metabolic traits in a birth cohort from a founder population , 2008, Nature Genetics.

[7]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[8]  T Egeland,et al.  Inferring the Most Likely Geographical Origin of mtDNA Sequence Profiles , 2004, Annals of human genetics.

[9]  T Varilo,et al.  Molecular genetics of the Finnish disease heritage. , 1999, Human molecular genetics.

[10]  Christian Gieger,et al.  Correlation between Genetic and Geographic Structure in Europe , 2008, Current Biology.

[11]  R. Norio Finnish Disease Heritage I: characteristics, causes, background. , 2003, Human genetics.

[12]  G. McVean A Genealogical Interpretation of Principal Components Analysis , 2009, PLoS genetics.

[13]  Alkes L. Price,et al.  Reconstructing Indian Population History , 2009, Nature.

[14]  Andrew Collins,et al.  The genome-wide patterns of variation expose significant substructure in a founder population. , 2008, American journal of human genetics.

[15]  T. Paunio,et al.  The interval of linkage disequilibrium (LD) detected with microsatellite and SNP markers in chromosomes of Finnish populations with different histories. , 2003, Human molecular genetics.

[16]  M. Weale,et al.  Genes predict village of origin in rural Europe , 2010, European Journal of Human Genetics.

[17]  Nicolas Ray,et al.  Principal component analysis under population genetic models of range expansion and admixture. , 2010, Molecular biology and evolution.

[18]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[19]  Shuhua Xu,et al.  Genomic dissection of population substructure of Han Chinese and its implication in association studies. , 2009, American journal of human genetics.

[20]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.