De novo assembly of ultra-deep sequencing data

Life scientists and bio-informaticians have struggled with insufficient amount of sequencing data since the beginning of Sanger sequencing in the seventies. As a consequence, most of the de novo assembly methods that have been proposed are designed to deal with low coverage sequencing and unbalanced depth of coverage. The situation is now about to change. The cost of sequencing has been decreasing so much that it is interesting to think about the possibility to have "as much sequencing data as we want". When the sequencing will be so cheap that scientists can decide about their desired depth of coverage without being worried about cost, the following question arises: assuming today's sequencing error rate, does higher depth of coverage necessarily lead to a better quality assembly? In this study, we demonstrate for the first time that current state-of-the-art assemblers are unable to handle ultra-deep (i.e., 1,000-10,000x) depth of coverage. We then propose a new method to build high quality assemblies from ultra-deep sequencing data. Our approach is based on "data slicing": we split a large dataset into "slices", then assemble each slice individually using a off-the-shelves assembler. Our tool then merges optimally the individual assemblies. Experimental results show that our method can improve significantly the quality of the assemblies, when compared to the assemblies of the individual slices.

[1]  M. Stratton,et al.  Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing , 2008, Proceedings of the National Academy of Sciences.

[2]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[3]  Soetjipto,et al.  A Deep-Sequencing Method Detects Drug-Resistant Mutations in the Hepatitis B Virus in Indonesians , 2014, Intervirology.

[4]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[5]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[6]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[7]  Hayssam Soueidan,et al.  Finishing bacterial genome assemblies with Mix , 2013, BMC Bioinformatics.

[8]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[9]  Tim H. Brom,et al.  A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data , 2012, 1203.4802.

[10]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[11]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[12]  S. Salzberg,et al.  Hierarchical scaffolding with Bambus. , 2003, Genome research.

[13]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[14]  Marcel J. T. Reinders,et al.  Integrating genome assemblies with MAIA , 2010, Bioinform..

[15]  Alberto Policriti,et al.  GAM-NGS: genomic assemblies merger for next generation sequencing , 2013, BMC Bioinformatics.

[16]  Hans Ellegren,et al.  Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria , 2014, BMC Genomics.

[17]  Abhay Jere,et al.  Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data , 2013, PloS one.

[18]  Niko Beerenwinkel,et al.  Ultra-deep sequencing for the analysis of viral populations. , 2011, Current opinion in virology.

[19]  Lucian Ilie,et al.  RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[20]  Gianfranco Ciardo,et al.  When less is more: 'slicing' sequencing data improves read decoding accuracy and de novo assembly quality , 2015, Bioinform..

[21]  Derrick E. Fouts,et al.  NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly , 2014, BMC Bioinformatics.

[22]  Funda Meric-Bernstam,et al.  Bias from removing read duplication in ultra-deep sequencing experiments , 2014, Bioinform..

[23]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[24]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[25]  H. Bayley,et al.  Continuous base identification for single-molecule nanopore DNA sequencing. , 2009, Nature nanotechnology.

[26]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..