FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes

De novo genome assembly of outbred diploid organisms remains a challenge in computational biology due to the difficulty of resolving similar haplotypes. FALCON-Unzip, a phased diploid genome assembler, separates PacBio long-reads by haplotype during assembly. The assembler outputs contiguous primary contigs, which are pseudohaplotypes containing phased haplotype regions and collapsed haplotypes. The ability to phase depends on the density of heterozygous variants, depth of coverage, and read length. As a result, haplotype phase information is lost when phase blocks are interrupted by regions of low heterozygosity, resulting in phase switches. Here, we present FALCON-Phase, a new method that resolves phase-switches by reconstructing contig-length phase blocks using Hi-C short-reads mapped to both homozygous regions and phase blocks. Such Hi-C data contain ultra-long-range phasing information (>1Mb). The novel FALCON-Phase algorithm is highly accurate (>96%) when benchmarked against a pedigree-based truth-set. The FALCON-Phase pipeline can also be extended to scaffolds to generate chromosome-scale phase blocks. The code is freely available (https://github.com/phasegenomics/FALCON-Phase) under a BSD and attribution license.

[1]  Leo van Iersel,et al.  WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads , 2014, RECOMB.

[2]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[3]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[4]  Bing Ren,et al.  Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing , 2013, Nature Biotechnology.

[5]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[6]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[7]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[8]  Sergey Koren,et al.  Complete assembly of parental haplotypes with trio binning , 2018, bioRxiv.

[9]  Li Ding,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2018, Nature Communications.

[10]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[11]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[12]  Christophe Klopp,et al.  D-GENIES: dot plot large genomes in an interactive, efficient and simple way , 2018, PeerJ.

[13]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[14]  Bernard Gendron,et al.  Parallel Branch-and-Branch Algorithms: Survey and Synthesis , 1994, Oper. Res..

[15]  Steven L. Salzberg,et al.  Unexpected cross-species contamination in genome sequencing projects , 2014, PeerJ.

[16]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[17]  Han Fang,et al.  GenomeScope: Fast reference-free genome profiling from short reads , 2016, bioRxiv.

[18]  Arkarachai Fungtammasan,et al.  How well can we create phased, diploid, human genomes?: An assessment of FALCON-Unzip phasing using a human trio , 2018, bioRxiv.

[19]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[20]  Sven Rahmann,et al.  Genome analysis , 2022 .

[21]  Tam P. Sneddon,et al.  Long-read genome sequencing identifies causal structural variation in a Mendelian disease , 2017, Genetics in Medicine.

[22]  James T. Robinson,et al.  Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. , 2016, Cell systems.

[23]  H. Ellegren,et al.  Determinants of genetic diversity , 2016, Nature Reviews Genetics.

[24]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[25]  S. Koren,et al.  Scaffolding of long read assemblies using long range contact information , 2016, BMC Genomics.

[26]  Adam M. Phillippy,et al.  MUMmer4: A fast and versatile genome alignment system , 2018, PLoS Comput. Biol..

[27]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[28]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[29]  G. Luikart,et al.  Genomics and the future of conservation genetics , 2010, Nature Reviews Genetics.

[30]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[31]  David Haussler,et al.  Long-read sequence assembly of the gorilla genome , 2016, Science.

[32]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[33]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[34]  A. Borneman,et al.  Purge Haplotigs: Synteny Reduction for Third-gen Diploid Genome Assemblies , 2018, bioRxiv.

[35]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[36]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[37]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[38]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017 .

[39]  Jonas Korlach,et al.  De Novo PacBio long-read and phased avian genome assemblies correct and add to genes important in neuroscience research , 2017, bioRxiv.

[40]  Ira M. Hall,et al.  SAMBLASTER: fast duplicate marking and structural variant read extraction , 2014, Bioinform..

[41]  David Haussler,et al.  High-resolution comparative analysis of great ape genomes , 2018, Science.

[42]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.

[43]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[44]  Steven G. Schroeder,et al.  Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome , 2017, Nature Genetics.

[45]  Jonas Korlach,et al.  De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads , 2017, GigaScience.

[46]  A. Halpern,et al.  An MCMC algorithm for haplotype assembly from whole-genome sequence data. , 2008, Genome research.