Estimation of Recent Ancestral Origins of Individuals on a Large Scale

The last ten years have seen an exponential growth of direct-to-consumer genomics. One popular feature of these tests is the report of a distant ancestral inference profile-a breakdown of the regions of the world where the test-taker's ancestors may have lived. While current methods and products generally focus on the more distant past (e.g., thousands of years ago), we have recently demonstrated that by leveraging network analysis tools such as community detection, more recent ancestry can be identified. However, using a network analysis tool like community detection on a large network with potentially millions of nodes is not feasible in a live production environment where hundreds or thousands of new genotypes are processed every day. In this study, we describe a classification method that leverages network features to assign individuals to communities in a large network corresponding to recent ancestry. We recently launched a beta version of this research as a new product feature at AncestryDNA.

[1]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[2]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[3]  Mattias Jakobsson,et al.  Tracing the peopling of the world through genomics , 2017, Nature.

[4]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[5]  Ross E. Curtis,et al.  AncestryDNA Matching White Paper Discovering genetic matches across a massive , expanding genetic database , 2016 .

[6]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[7]  Christopher R. Gignoux,et al.  The genetics of Mexico recapitulates Native American substructure and affects biomedical traits , 2014, Science.

[8]  George Forman,et al.  Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement , 2010, SKDD.

[9]  M. Pirinen,et al.  The fine-scale genetic structure of the British population , 2015, Nature.

[10]  H. Hattemer,et al.  Genetic distance between populations , 1982, Theoretical and Applied Genetics.

[11]  Ross E. Curtis,et al.  Clustering of 770,000 genomes reveals post-colonial population structure of North America , 2017, Nature Communications.

[12]  C. Bustamante,et al.  RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. , 2013, American journal of human genetics.

[13]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[15]  Alexander Gusev,et al.  Whole population, genome-wide mapping of hidden relatedness. , 2009, Genome research.

[16]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[17]  D. Falush,et al.  Inference of Population Structure using Dense Haplotype Data , 2012, PLoS genetics.

[18]  Robert C. Green,et al.  Direct-to-Consumer Genetic Testing: User Motivations, Decision Making, and Perceived Utility of Results , 2017, Public Health Genomics.

[19]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[20]  S. Tavaré,et al.  The age of a mutation in a general coalescent tree , 1998 .