Fast and Scalable Genome-Wide Inference of Local Tree Topologies from Large Number of Haplotypes Based on Tree Consistent 𝒫ℬ𝒲𝒯 Data Structure

Estimation of the relationship between DNA sequences is one of the most important problems in genomics. Understanding these relationships is central to demographic inference, correction of population structure in GWAS, identifying signals of selection etc. The data structure containing the full information about sample genealogy is called the ancestral recombination graph . However, inference is a very difficult problem, not least due to a very complex state space. In this work we describe a new approach for fast and scalable generation of local tree topologies relating large numbers of haplotypes. Our method is closely related to the estimation of , and captures both local and global properties of an . It is based on a data structure which we call tree consistent , a modification of data structure introduced by R. Durbin (2014). We also explore some methods to estimate the quality of the generated tree topologies and to make inferences based on them. At the end we discuss a probabilistic model which could potentially lead to the estimation of node times.