Integrating read-based and population-based phasing for dense and accurate haplotyping of individual genomes

Abstract Motivation Reconstruction of haplotypes for human genomes is an important problem in medical and population genetics. Hi-C sequencing generates read pairs with long-range haplotype information that can be computationally assembled to generate chromosome-spanning haplotypes. However, the haplotypes have limited completeness and low accuracy. Haplotype information from population reference panels can potentially be used to improve the completeness and accuracy of Hi-C haplotyping. Results In this paper, we describe a likelihood based method to integrate short-range haplotype information from a population reference panel of haplotypes with the long-range haplotype information present in sequence reads from methods such as Hi-C to assemble dense and highly accurate haplotypes for individual genomes. Our method leverages a statistical phasing method and a maximum spanning tree algorithm to determine the optimal second-order approximation of the population-based haplotype likelihood for an individual genome. The population-based likelihood is encoded using pseudo-reads which are then used as input along with sequence reads for haplotype assembly using an existing tool, HapCUT2. Using whole-genome Hi-C data for two human genomes (NA19240 and NA12878), we demonstrate that this integrated phasing method enables the phasing of 97–98% of variants, reduces the switch error rates by 3–6-fold, and outperforms an existing method for combining phase information from sequence reads with population-based phasing. On Strand-seq data for NA12878, our method improves the haplotype completeness from 71.4 to 94.6% and reduces the switch error rate 2-fold, demonstrating its utility for phasing using multiple sequencing technologies. Availability and implementation Code and datasets are available at https://github.com/vibansal/IntegratedPhasing.

[1]  Bing Ren,et al.  Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing , 2013, Nature Biotechnology.

[2]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[3]  Victor Guryev,et al.  Dense and accurate whole-chromosome haplotyping of individual genomes , 2017, Nature Communications.

[4]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[5]  Matthew W. Snyder,et al.  Haplotype-resolved genome sequencing: experimental methods and applications , 2015, Nature Reviews Genetics.

[6]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[7]  Volodymyr Kuleshov,et al.  Probabilistic single-individual haplotyping , 2014, Bioinform..

[8]  Victor Guryev,et al.  Direct chromosome-length haplotyping by single-cell sequencing , 2016, Genome research.

[9]  Paul Flicek,et al.  The international Genome sample resource (IGSR): A worldwide collection of genome variation incorporating the 1000 Genomes Project data , 2016, Nucleic Acids Res..

[10]  V. Bafna,et al.  Ultraaccurate genome sequencing and haplotyping of single human cells , 2017, Proceedings of the National Academy of Sciences.

[11]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[12]  Andrew C. Adey,et al.  Haplotype-resolved genome sequencing of a Gujarati Indian individual , 2011, Nature Biotechnology.

[13]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[14]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[15]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[16]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[17]  Jean-François Zagury,et al.  Haplotype estimation using sequencing reads. , 2013, American journal of human genetics.

[18]  Jessica C. Ebert,et al.  Accurate whole genome sequencing and haplotyping from10-20 human cells , 2012, Nature.

[19]  Jorge Duitama,et al.  ReFHap: a reliable and fast algorithm for single individual haplotyping , 2010, BCB '10.

[20]  Sergey Koren,et al.  Integrating Hi-C links with assembly graphs for chromosome-scale assembly , 2019, PLoS Comput. Biol..

[21]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[22]  Zohar Yakhini,et al.  Extending partial haplotypes to full genome haplotypes using chromosome conformation capture data , 2016, Bioinform..

[23]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[24]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[25]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[26]  Sorin Istrail,et al.  HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data , 2012, J. Comput. Biol..

[27]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[28]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[29]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[30]  Dmitry Pushkarev,et al.  Whole-genome haplotyping using long reads and statistical methods , 2014, Nature Biotechnology.