Finding All Maximal Perfect Haplotype Blocks in Linear Time

Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Proceedings of BSB 2019) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.

[1]  Luis Antonio Brasil Kowada,et al.  Identifying Maximal Perfect Haplotype Blocks , 2018, BSB.

[2]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[3]  Gerton Lunter,et al.  Haplotype matching in large cohorts using the Li and Stephens model , 2018, Bioinform..

[4]  Jody Hey,et al.  A Hidden Markov Model for Investigating Recent Positive Selection through Haplotype Structure , 2014, bioRxiv.

[5]  M. Kimura The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. , 1969, Genetics.

[6]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[7]  Martin Farach-Colton,et al.  Optimal Suffix Tree Construction with Large Alphabets , 1997, FOCS.

[8]  Yvonne Feierabend,et al.  Population Genetics A Concise Guide , 2016 .

[9]  Damian Smedley,et al.  The 100 000 Genomes Project: bringing whole genome sequencing to the NHS , 2018, British Medical Journal.

[10]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[11]  Richard Durbin,et al.  Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT) , 2014, Bioinform..

[12]  Bjarni V. Halldórsson,et al.  Large-scale whole-genome sequencing of the Icelandic population , 2015, Nature Genetics.

[13]  Alexander Schönhuth,et al.  A high-quality human reference panel reveals the complexity and distribution of genomic structural variants , 2016, Nature communications.

[14]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[15]  Jens Stoye,et al.  Finding all maximal perfect haplotype blocks in linear time , 2019, Algorithms for Molecular Biology.

[16]  Helen E. Parkinson,et al.  The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 , 2018, Nucleic Acids Res..

[17]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .