Efficient haplotype matching between a query and a panel for genealogical search

Abstract Motivation With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin’s Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple ‘good enough’ matches are desired. Results In this work, we developed two algorithmic extensions of Durbin’s Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce ‘virtual insertion’ of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search. Availability and implementation genome.ucf.edu/pbwt-query Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Xiang Zhou,et al.  Genome Wide Sampling Sequencing for SNP Genotyping: Methods, Challenges and Future Development , 2016, International journal of biological sciences.

[2]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[3]  Gary K. Chen,et al.  Fast and flexible simulation of DNA sequence data. , 2008, Genome research.

[4]  Alexander Gusev,et al.  Whole population, genome-wide mapping of hidden relatedness. , 2009, Genome research.

[5]  Serafim Batzoglou,et al.  Parente2: a fast and accurate method for detecting identity by descent , 2015, Genome research.

[6]  B. Browning,et al.  Improving the Accuracy and Efficiency of Identity-by-Descent Detection in Population Data , 2013, Genetics.

[7]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[8]  David Mittelman,et al.  Consumer genomics will change your life, whether you get tested or not , 2018, Genome Biology.

[9]  B. Browning,et al.  A fast, powerful method for detecting identity by descent. , 2011, American journal of human genetics.

[10]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[11]  E. Thompson Identity by Descent: Variation in Meiosis, Across Genomes, and in Populations , 2013, Genetics.

[12]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[13]  Shane A. McCarthy,et al.  Reference-based phasing using the Haplotype Reference Consortium panel , 2016, Nature Genetics.

[14]  Serafim Batzoglou,et al.  An Accurate Method for Inferring Relatedness in Large Datasets of Unphased Genotypes via an Embedded Likelihood-Ratio Test , 2013, RECOMB.

[15]  Yaniv Erlich,et al.  Identity inference of genomic data using long-range familial searches , 2018, Science.

[16]  Po-Ru Loh,et al.  Fast and accurate long-range phasing in a UK Biobank cohort , 2015, Nature Genetics.

[17]  Brian L Browning,et al.  Detecting identity by descent and estimating genotype error rates in sequence data. , 2013, American journal of human genetics.

[18]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..

[19]  S. Narum,et al.  Genotyping‐in‐Thousands by sequencing (GT‐seq): A cost effective SNP genotyping method based on custom amplicon sequencing , 2015, Molecular ecology resources.

[20]  Alkes L. Price,et al.  Fast and accurate long-range phasing in a UK Biobank cohort , 2015 .