Improving draft genome contiguity with reference-derived in silico mate-pair libraries

Abstract Background Contiguous genome assemblies are a highly valued biological resource because of the higher number of completely annotated genes and genomic elements that are usable compared to fragmented draft genomes. Nonetheless, contiguity is difficult to obtain if only low coverage data and/or only distantly related reference genome assemblies are available. Findings In order to improve genome contiguity, we have developed Cross-Species Scaffolding—a new pipeline that imports long-range distance information directly into the de novo assembly process by constructing mate-pair libraries in silico. Conclusions We show how genome assembly metrics and gene prediction dramatically improve with our pipeline by assembling two primate genomes solely based on ∼30x coverage of shotgun sequencing data.

[1]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[2]  A. Mikheyev,et al.  A first look at the Oxford Nanopore MinION sequencer , 2014, Molecular ecology resources.

[3]  Edwin Cuppen,et al.  Improving mammalian genome scaffolding using large insert mate-pair next-generation sequencing , 2013, BMC Genomics.

[4]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[5]  S. O’Brien,et al.  A Molecular Phylogeny of Living Primates , 2011, PLoS genetics.

[6]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[7]  Jonathan M D Wood,et al.  Using optical mapping data for the improvement of vertebrate genome assemblies , 2015, GigaScience.

[8]  Joana Damas,et al.  Upgrading short-read animal genome assemblies to chromosome level using comparative genomics and a universal probe set , 2017, Genome research.

[9]  Mihai Pop,et al.  Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies , 2011, BMC Bioinformatics.

[10]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[11]  Richard R. Copley,et al.  Scaffolding low quality genomes using orthologous protein sequences , 2012, Bioinform..

[12]  Alan Christoffels,et al.  Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding , 2016, PLoS genetics.

[13]  Yu-Chieh Liao,et al.  Evaluation and Validation of Assembling Corrected PacBio Long Reads for Microbial Genome Completion via Hybrid Approaches , 2015, PloS one.

[14]  Gaik Tamazian,et al.  Chromosomer: a reference-based genome arrangement tool for producing draft chromosome sequences , 2016, GigaScience.

[15]  J. Wolf,et al.  A field guide to whole-genome sequencing, assembly and annotation , 2014, Evolutionary applications.

[16]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[17]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[18]  Matthew W. Hahn,et al.  AGOUTI: improving genome assembly and annotation using transcriptome data , 2015, bioRxiv.

[19]  Henry C. Lin Theoretical Bounds on Mate-Pair Information for Accurate Genome Assembly , 2013, 1310.1653.

[20]  Jun Xiao,et al.  PEP_scaffolder: using (homologous) proteins to scaffold genomes , 2016, Bioinform..

[21]  Tao Jiang,et al.  AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references , 2014, Bioinform..

[22]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[23]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[24]  M. Baker De novo genome assembly: what every biologist should know , 2012, Nature Methods.

[25]  Tyler A. Elliott,et al.  Do larger genomes contain more diverse transposable elements? , 2015, BMC Evolutionary Biology.

[26]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[27]  Burkhard Morgenstern,et al.  AUGUSTUS: ab initio prediction of alternative transcripts , 2006, Nucleic Acids Res..

[28]  Michael C. Schatz,et al.  Third-generation sequencing and the future of genomics , 2016, bioRxiv.

[29]  Pietro Liò,et al.  MeDuSa: a multi-draft based scaffolder , 2015, Bioinform..

[30]  L. Florea,et al.  Rascaf: Improving Genome Assembly with RNA Sequencing Data , 2016, The plant genome.

[31]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[32]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[33]  Steven Salzberg,et al.  Beware of mis-assembled genomes , 2005, Bioinform..

[34]  M. Meyer,et al.  A Mitogenomic Phylogeny of Living Primates , 2013, PloS one.

[35]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[36]  Loretta Auvil,et al.  Reference-assisted chromosome assembly , 2013, Proceedings of the National Academy of Sciences.

[37]  Qi Zheng,et al.  AlignerBoost: A Generalized Software Toolkit for Boosting Next-Gen Sequencing Mapping Accuracy Using a Bayesian-Based Mapping Quality Framework , 2016, PLoS Comput. Biol..

[38]  Dmitry Antipov,et al.  hybridSPAdes: an algorithm for hybrid assembly of short and long reads , 2016, Bioinform..

[39]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[40]  Heng Li,et al.  BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[41]  S. O’Brien,et al.  The Genome 10K Project: a way forward. , 2015, Annual review of animal biosciences.