De novo meta-assembly of ultra-deep sequencing data

We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e. coverage of 1000x or higher). Our proposed meta-assembler Slicembler partitions the input data into optimal-sized ‘slices’ and uses a standard assembly tool (e.g. Velvet, SPAdes, IDBA_UD and Ray) to assemble each slice individually. Slicembler uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, Slicembler uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8000x coverage) and simulated data show that Slicembler significantly improves the quality of the assembly compared with the performance of the base assembler. In fact, most of the times, Slicembler generates error-free assemblies. We also show that Slicembler is much more resistant against high sequencing error rate than the base assembler. Availability and implementation: Slicembler can be accessed at http://slicembler.cs.ucr.edu/. Contact: hamid.mirebrahim@email.ucr.edu

[1]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[2]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[3]  Derrick E. Fouts,et al.  NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly , 2014, BMC Bioinformatics.

[4]  Soetjipto,et al.  A Deep-Sequencing Method Detects Drug-Resistant Mutations in the Hepatitis B Virus in Indonesians , 2014, Intervirology.

[5]  M. Stratton,et al.  Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing , 2008, Proceedings of the National Academy of Sciences.

[6]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[7]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[8]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[9]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[10]  Alberto Policriti,et al.  GAM-NGS: genomic assemblies merger for next generation sequencing , 2013, BMC Bioinformatics.

[11]  Hans Ellegren,et al.  Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria , 2014, BMC Genomics.

[12]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[13]  Hayssam Soueidan,et al.  Finishing bacterial genome assemblies with Mix , 2013, BMC Bioinformatics.

[14]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[15]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[16]  Marcel J. T. Reinders,et al.  Integrating genome assemblies with MAIA , 2010, Bioinform..

[17]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[18]  Tim H. Brom,et al.  A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data , 2012, 1203.4802.

[19]  Abhay Jere,et al.  Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data , 2013, PloS one.

[20]  Niko Beerenwinkel,et al.  Ultra-deep sequencing for the analysis of viral populations. , 2011, Current opinion in virology.

[21]  Mihaela M. Martis,et al.  A physical, genetic and functional sequence assembly of the barley genome. , 2022 .

[22]  Funda Meric-Bernstam,et al.  Bias from removing read duplication in ultra-deep sequencing experiments , 2014, Bioinform..

[23]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[24]  Lucian Ilie,et al.  RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[25]  Gianfranco Ciardo,et al.  When less is more: 'slicing' sequencing data improves read decoding accuracy and de novo assembly quality , 2015, Bioinform..

[26]  H. Bayley,et al.  Continuous base identification for single-molecule nanopore DNA sequencing. , 2009, Nature nanotechnology.

[27]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..