Genome assembly from synthetic long read clouds

Motivation: Despite rapid progress in sequencing technology, assembling de novo the genomes of new species as well as reconstructing complex metagenomes remains major technological challenges. New synthetic long read (SLR) technologies promise significant advances towards these goals; however, their applicability is limited by high sequencing requirements and the inability of current assembly paradigms to cope with combinations of short and long reads. Results: Here, we introduce Architect, a new de novo scaffolder aimed at SLR technologies. Unlike previous assembly strategies, Architect does not require a costly subassembly step; instead it assembles genomes directly from the SLR’s underlying short reads, which we refer to as read clouds. This enables a 4- to 20-fold reduction in sequencing requirements and a 5-fold increase in assembly contiguity on both genomic and metagenomic datasets relative to state-of-the-art assembly strategies aimed directly at fully subassembled long reads. Availability and Implementation: Our source code is freely available at https://github.com/kuleshov/architect. Contact: kuleshov@stanford.edu

[1]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[2]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[3]  Maitreya J. Dunham,et al.  Species-Level Deconvolution of Metagenome Assemblies with Hi-C–Based Contact Probability Maps , 2014, G3: Genes, Genomes, Genetics.

[4]  Laurie Goodman,et al.  Large and linked in scientific publishing , 2012, GigaScience.

[5]  Jessica C. Ebert,et al.  Accurate whole genome sequencing and haplotyping from10-20 human cells , 2012, Nature.

[6]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[7]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[8]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[9]  Martin Hunt,et al.  The genome and life-stage specific transcriptomes of Globodera pallida elucidate key aspects of plant parasitism by a cyst nematode , 2014, Genome Biology.

[10]  Dmitry Pushkarev,et al.  Whole-genome haplotyping using long reads and statistical methods , 2014, Nature Biotechnology.

[11]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[13]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[14]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[15]  Peng Cui,et al.  Dynamic regulation of genome-wide pre-mRNA splicing and stress tolerance by the Sm-like protein LSm5 in Arabidopsis , 2014, Genome Biology.

[16]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[17]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[18]  Volodymyr Kuleshov,et al.  Probabilistic single-individual haplotyping , 2014, Bioinform..

[19]  Andrew C. Adey,et al.  In vitro, long-range sequence information for de novo genome assembly via transposase contiguity , 2014, Genome research.

[20]  Brian C. Thomas,et al.  Accurate, multi-kb reads resolve complex populations and detect rare microorganisms , 2015, Genome research.

[21]  M. Berriman,et al.  A comprehensive evaluation of assembly scaffolding tools , 2014, Genome Biology.

[22]  Mihai Pop,et al.  Minimus: a fast, lightweight genome assembler , 2007, BMC Bioinformatics.

[23]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[24]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[25]  Rajiv C. McCoy,et al.  Illumina TruSeq Synthetic Long-Reads Empower De Novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements , 2014, bioRxiv.

[26]  Zhongying Zhao,et al.  Illumina Synthetic Long Read Sequencing Allows Recovery of Missing Sequences even in the “Finished” C. elegans Genome , 2015, Scientific Reports.

[27]  Andrew C. Adey,et al.  Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing , 2014, Nature Genetics.

[28]  Daniel E. Newburger,et al.  Read clouds uncover variation in complex regions of the human genome. , 2015, Genome research.

[29]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[30]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[31]  Andrew C. Adey,et al.  Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions , 2013, Nature Biotechnology.

[32]  K. Verstrepen,et al.  Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques , 2011, Nucleic acids research.

[33]  Aaron M. Newman,et al.  The genome sequence of the colonial chordate, Botryllus schlosseri , 2013, eLife.