Extended haplotype phasing of de novo genome assemblies with FALCON-Phase

ABSTRACT Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. These assemblies can be created in various ways, such as use of tissues that contain single-haplotype (haploid) genomes, or by co-sequencing of parental genomes, but these approaches can be impractical in many situations. We present FALCON-Phase, which integrates long-read sequencing data and ultra-long-range Hi-C chromatin interaction data of a diploid individual to create high-quality, phased diploid genome assemblies. The method was evaluated by application to three datasets, including human, cattle, and zebra finch, for which high-quality, fully haplotype resolved assemblies were available for benchmarking. Phasing algorithm accuracy was affected by heterozygosity of the individual sequenced, with higher accuracy for cattle and zebra finch (>97%) compared to human (82%). In addition, scaffolding with the same Hi-C chromatin contact data resulted in phased chromosome-scale scaffolds.

[1]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[2]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[3]  S. Koren,et al.  Scaffolding of long read assemblies using long range contact information , 2016, BMC Genomics.

[4]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[5]  Fritz J Sedlazeck,et al.  Piercing the dark matter: bioinformatics of long-range sequencing and mapping , 2018, Nature Reviews Genetics.

[6]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[7]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[8]  Bing Ren,et al.  Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing , 2013, Nature Biotechnology.

[9]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[10]  Jonas Korlach,et al.  De Novo PacBio long-read and phased avian genome assemblies correct and add to genes important in neuroscience research , 2017, bioRxiv.

[11]  Ira M. Hall,et al.  SAMBLASTER: fast duplicate marking and structural variant read extraction , 2014, Bioinform..

[12]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[13]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[14]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[15]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[16]  H. Ellegren,et al.  Determinants of genetic diversity , 2016, Nature Reviews Genetics.

[17]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[18]  G. Luikart,et al.  Genomics and the future of conservation genetics , 2010, Nature Reviews Genetics.

[19]  Tam P. Sneddon,et al.  Long-read genome sequencing identifies causal structural variation in a Mendelian disease , 2017, Genetics in Medicine.

[20]  James T. Robinson,et al.  Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. , 2016, Cell systems.

[21]  David Haussler,et al.  High-resolution comparative analysis of great ape genomes , 2018, Science.

[22]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.

[23]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[24]  A. Borneman,et al.  Purge Haplotigs: Synteny Reduction for Third-gen Diploid Genome Assemblies , 2018, bioRxiv.

[25]  Steven G. Schroeder,et al.  Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome , 2017, Nature Genetics.

[26]  A. Halpern,et al.  An MCMC algorithm for haplotype assembly from whole-genome sequence data. , 2008, Genome research.

[27]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017 .

[28]  Heng Li,et al.  Toward better understanding of artifacts in variant calling from high-coverage samples , 2014, Bioinform..

[29]  Christophe Klopp,et al.  D-GENIES: dot plot large genomes in an interactive, efficient and simple way , 2018, PeerJ.

[30]  Heng Li,et al.  Minimap2: fast pairwise alignment for long DNA sequences , 2017 .

[31]  Bernard Gendron,et al.  Parallel Branch-and-Branch Algorithms: Survey and Synthesis , 1994, Oper. Res..

[32]  Leo van Iersel,et al.  WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads , 2014, RECOMB.

[33]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[34]  Li Ding,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2018, Nature Communications.

[35]  Han Fang,et al.  GenomeScope: Fast reference-free genome profiling from short reads , 2016, bioRxiv.

[36]  Arkarachai Fungtammasan,et al.  How well can we create phased, diploid, human genomes?: An assessment of FALCON-Unzip phasing using a human trio , 2018, bioRxiv.

[37]  Adam M. Phillippy,et al.  MUMmer4: A fast and versatile genome alignment system , 2018, PLoS Comput. Biol..