De novo assembly of haplotype-resolved genomes with trio binning

Complex allelic variation hampers the assembly of haplotype-resolved sequences from diploid genomes. We developed trio binning, an approach that simplifies haplotype assembly by resolving allelic variation before assembly. In contrast with prior approaches, the effectiveness of our method improved with increasing heterozygosity. Trio binning uses short reads from two parental genomes to first partition long reads from an offspring into haplotype-specific sets. Each haplotype is then assembled independently, resulting in a complete diploid reconstruction. We used trio binning to recover both haplotypes of a diploid human genome and identified complex structural variants missed by alternative approaches. We sequenced an F1 cross between the cattle subspecies Bos taurus taurus and Bos taurus indicus and completely assembled both parental haplotypes with NG50 haplotig sizes of >20 Mb and 99.998% accuracy, surpassing the quality of current cattle reference genomes. We suggest that trio binning improves diploid genome assembly and will facilitate new studies of haplotype variation and inheritance.

[1]  David R. Kelley,et al.  A whole-genome assembly of the domestic cow, Bos taurus , 2009, Genome Biology.

[2]  Bing Ren,et al.  Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing , 2013, Nature Biotechnology.

[3]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[4]  Michael C. Schatz,et al.  Ribbon: Visualizing complex genome alignments and structural variation , 2016, bioRxiv.

[5]  Victor Guryev,et al.  Dense and accurate whole-chromosome haplotyping of individual genomes , 2017, Nature Communications.

[6]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[7]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[8]  Vineet Bafna,et al.  HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies , 2017, Genome research.

[9]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[10]  Adam M Phillippy,et al.  New advances in sequence assembly , 2017, Genome research.

[11]  Maria Angélica Souza,et al.  Genome sequence and assembly of Bos indicus. , 2012, The Journal of heredity.

[12]  Daniel Mapleson,et al.  KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies , 2016, bioRxiv.

[13]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[14]  Michael C. Schatz,et al.  Assemblytics: a web analytics tool for the detection of variants from an assembly , 2016, Bioinform..

[15]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[16]  Alvaro G. Hernandez,et al.  Whole-genome resequencing of two elite sires for the detection of haplotypes under selection in dairy cattle , 2012, Proceedings of the National Academy of Sciences.

[17]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[18]  Yi Luo,et al.  How independent are the appearances of n-mers in different genomes? , 2004, Bioinform..

[19]  †The International HapMap Consortium The International HapMap Project , 2003, Nature.

[20]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[21]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[22]  Leo van Iersel,et al.  WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads , 2015, J. Comput. Biol..

[23]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[24]  Sreeram Kannan,et al.  Resolving Multicopy Duplications de novo Using Polyploid Phasing , 2017, RECOMB.

[25]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[26]  Peter M Lansdorp,et al.  Strand-seq: a unifying tool for studies of chromosome segregation. , 2013, Seminars in cell & developmental biology.

[27]  J. Korlach,et al.  De novo assembly and phasing of a Korean human genome , 2016, Nature.

[28]  Sergey A. Shiryev,et al.  Single haplotype assembly of the human genome from a hydatidiform mole , 2014, bioRxiv.

[29]  Alexander T. Dilthey,et al.  High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs , 2016, PLoS Comput. Biol..

[30]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[31]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[32]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[33]  R. Durbin,et al.  trio-sga: facilitating de novo assembly of highly heterozygous genomes with parent-child trios , 2016, bioRxiv.

[34]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[35]  Jill P Mesirov,et al.  Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. , 2005, Genome research.

[36]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[37]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[38]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[39]  Benjamin D. Rosen,et al.  Genome-wide CNV analysis reveals variants associated with growth traits in Bos indicus , 2016, BMC Genomics.

[40]  Han Fang,et al.  GenomeScope: Fast reference-free genome profiling from short reads , 2016, bioRxiv.

[41]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[42]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[43]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[44]  David Haussler,et al.  The UCSC Genome Browser database: 2018 update , 2017, Nucleic Acids Res..

[45]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[46]  Wing Hung Wong,et al.  Completely phased genome sequencing through chromosome sorting , 2010, Proceedings of the National Academy of Sciences.

[47]  Robert M. Waterhouse,et al.  BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics , 2017, bioRxiv.

[48]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[49]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[50]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[51]  Jonas Korlach,et al.  De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads , 2017, GigaScience.

[52]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[53]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[54]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[55]  Jian Wang,et al.  De novo assembly of a haplotype-resolved human genome , 2015, Nature Biotechnology.