Integrating Hi-C links with assembly graphs for chromosome-scale assembly

Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.

[1]  J. Craig Venter,et al.  A new strategy for genome sequencing , 1996, Nature.

[2]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[4]  Timothy P. L. Smith,et al.  Continuous chromosome-scale haplotypes assembled from a single interspecies F1 hybrid of yak and cattle , 2020, GigaScience.

[5]  R. Durbin,et al.  Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly , 2016, bioRxiv.

[6]  Kasper Daniel Hansen,et al.  Removing unwanted variation between samples in Hi-C experiments , 2017, bioRxiv.

[7]  J. Edmonds Paths, Trees, and Flowers , 1965, Canadian Journal of Mathematics.

[8]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[9]  D. Schwartz,et al.  Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. , 1993, Science.

[10]  James T. Robinson,et al.  Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. , 2016, Cell systems.

[11]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[12]  Matthias Poloczek,et al.  Randomized Greedy Algorithms for the Maximum Matching Problem with New Analysis , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[13]  S. Koren,et al.  A chromosome-scale assembly of the major African malaria vector Anopheles funestus , 2019, GigaScience.

[14]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[15]  Sergey Koren,et al.  Improved Aedes aegypti mosquito reference genome assembly enables biological discovery and vector control , 2017, bioRxiv.

[16]  Y. Sakakibara,et al.  An improved de novo genome assembly of the common marmoset genome yields improved contiguity and increased mapping rates of sequence data , 2020, BMC Genomics.

[17]  David Sankoff,et al.  Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. , 2018, Nature Genetics.

[18]  Hugh E. Olsen,et al.  The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community , 2016, Genome Biology.

[19]  Antoine Margeot,et al.  High-quality genome (re)assembly using chromosomal contact data , 2014, Nature Communications.

[20]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[21]  Anthony R. Borneman,et al.  Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies , 2018, BMC Bioinformatics.

[22]  S. O’Brien,et al.  The Genome 10K Project: a way forward. , 2015, Annual review of animal biosciences.

[23]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[24]  Brendan L. O’Connell,et al.  Chromosome-scale shotgun assembly using an in vitro method for long-range linkage , 2015, Genome research.

[25]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[26]  Daisy E. Pagete An end-to-end assembly of the Aedes aegypti genome , 2016, 1605.04619.

[27]  Justin Chu,et al.  ARCS: scaffolding genome drafts with linked reads , 2017, Bioinform..

[28]  Matthew W. Hahn,et al.  Evolutionary superscaffolding and chromosome anchoring to improve Anopheles genome assemblies , 2020, BMC Biology.

[29]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[30]  Nic Herndon,et al.  Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool , 2015, BMC Genomics.

[31]  Steven G. Schroeder,et al.  Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome , 2017, Nature Genetics.

[32]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[33]  B. Steensel,et al.  Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip (4C) , 2006, Nature Genetics.

[34]  Mitsutaka Kadota,et al.  Multifaceted Hi-C benchmarking: what makes a difference in chromosome-scale genome scaffolding? , 2019, bioRxiv.

[35]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[36]  Yu Lin,et al.  Assembly of long, error-prone reads using repeat graphs , 2018, Nature Biotechnology.

[37]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[38]  Jesse R. Dixon,et al.  Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions , 2012, Nature.

[39]  Timothy P. L. Smith,et al.  Haplotype-resolved genomes provide insights into structural variation and gene content in Angus and Brahman cattle , 2020, Nature Communications.

[40]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[41]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[42]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[43]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[44]  S. Koren,et al.  Scaffolding of long read assemblies using long range contact information , 2016, BMC Genomics.

[45]  Noam Kaplan,et al.  High-throughput genome scaffolding from in-vivo DNA interaction frequency , 2013, Nature Biotechnology.

[46]  Neva C. Durand,et al.  De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds , 2016, Science.

[47]  Guangrui Huang,et al.  HaploMerger: Reconstructing allelic relationships for polymorphic diploid genome assemblies , 2012, Genome research.

[48]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[49]  B. Mishra,et al.  Feature-by-Feature – Evaluating De Novo Sequence Assembly , 2012, PloS one.

[50]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[51]  Neva C. Durand,et al.  Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. , 2016, Cell systems.

[52]  Job Dekker,et al.  Organization of the Mitotic Chromosome , 2013, Science.

[53]  Deacon J. Sweeney,et al.  Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus) , 2012, Nature Biotechnology.

[54]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.