A haplotype-aware de novo assembly of related individuals using pedigree graph

Motivation Reconstructing high-quality haplotype-resolved assemblies for related individuals of various species has important applications in understanding Mendelian diseases along with evolutionary and comparative genomics. Through major genomics sequencing efforts such as the Personal Genome Project, the Vertebrate Genome Project (VGP), the Earth Biogenome Project (EBP) and the Genome in a Bottle project (GIAB), a variety of sequencing datasets from mother-father-child trios of various diploid species are becoming available. Current trio assembly approaches are not designed to incorporate long-read sequencing data from parents in a trio, and therefore require relatively high coverages of costly long-read data to produce high-quality assemblies. Thus, building a trio-aware assembler capable of producing accurate and chromosomal-scale diploid genomes in a pedigree, while being cost-effective in terms of sequencing costs, is a pressing need of the genomics community. Results We present a novel pedigree-graph-based approach to diploid assembly using accurate Illumina data and long-read Pacific Biosciences (PacBio) data from all related individuals, thereby generalizing our previous work on single individuals. We demonstrate the effectiveness of our pedigree approach on a simulated trio of pseudo-diploid yeast genomes with different heterozygosity rates, and real data from Arabidopsis Thaliana. We show that we require as little as 30× coverage Illumina data and 15× PacBio data from each individual in a trio to generate chromosomal-scale phased assemblies. Additionally, we show that we can detect and phase variants from generated phased assemblies. Availability https://github.com/shilpagarg/WHdenovo Contact shilpa_garg@hms.harvard.edu, gchurch@genetics.med.harvard.edu

[1]  Toni Gabaldón,et al.  Redundans: an assembly pipeline for highly heterozygous genomes , 2015, Nucleic acids research.

[2]  Dmitry Antipov,et al.  hybridSPAdes: an algorithm for hybrid assembly of short and long reads , 2016, Bioinform..

[3]  William Jones,et al.  Sequence variation aware genome references and read mapping with the variation graph toolkit , 2017, bioRxiv.

[4]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[5]  Leo van Iersel,et al.  WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads , 2014, RECOMB.

[6]  Heng Li,et al.  FermiKit: assembly-based variant calling for Illumina resequencing data , 2015, Bioinform..

[7]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[8]  Sreeram Kannan,et al.  Resolving Multicopy Duplications de novo Using Polyploid Phasing , 2017, RECOMB.

[9]  D. Rubinsztein Annual Review of Genomics and Human Genetics , 2001 .

[10]  R. Durbin,et al.  trio-sga: facilitating de novo assembly of highly heterozygous genomes with parent-child trios , 2016, bioRxiv.

[11]  Shilpa Garg,et al.  Read-based phasing of related individuals , 2016, bioRxiv.

[12]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[13]  Jing Li,et al.  Contrasting evolutionary genome dynamics between domesticated and wild yeasts , 2017, Nature Genetics.

[14]  Sergey Koren,et al.  Highly-accurate long-read sequencing improves variant detection and assembly of a human genome , 2019, bioRxiv.

[15]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[16]  James H. Bullard,et al.  A hybrid approach for the automated finishing of bacterial genomes , 2012, Nature Biotechnology.

[17]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[18]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[19]  M. Pop,et al.  The Theory and Practice of Genome Sequence Assembly. , 2015, Annual review of genomics and human genetics.

[20]  Shilpa Garg,et al.  Computational Haplotyping: Theory and Practice , 2018 .

[21]  Shilpa Garg,et al.  A graph-based approach to diploid genome assembly , 2018, Bioinform..

[22]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[23]  Glenn Hickey,et al.  Superbubbles, Ultrabubbles and Cacti , 2017, bioRxiv.

[24]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[25]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[26]  V. Bansal,et al.  The importance of phase information for human genomics , 2011, Nature Reviews Genetics.

[27]  Sergey Koren,et al.  Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii , a progenitor of bread wheat , with the mega-reads algorithm , 2016 .

[28]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[29]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.