Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data

Next-generation sequencing technologies have fostered an unprecedented proliferation of high-throughput sequencing projects and a concomitant development of novel algorithms for the assembly of short reads. In this context, an important issue is the need of a careful assessment of the accuracy of the assembly process. Here, we review the efficiency of a panel of assemblers, specifically designed to handle data from GS FLX 454 platform, on three bacterial data sets with different characteristics in terms of reads coverage and repeats content. Our aim is to investigate their strengths and weaknesses in the reconstruction of the reference genomes. In our benchmarking, we assess assemblers' performance, quantifying and characterizing assembly gaps and errors, and evaluating their ability to solve complex genomic regions containing repeats. The final goal of this analysis is to highlight pros and cons of each method, in order to provide the final user with general criteria for the right choice of the appropriate assembly strategy, depending on the specific needs. A further aspect we have explored is the relationship between coverage of a sequencing project and quality of the obtained results. The final outcome suggests that, for a good tradeoff between costs and results, the planned genome coverage of an experiment should not exceed 20-30 ×.

[1]  C. Cobelli,et al.  Draft Genome Sequences of Two Neisseria meningitidis Serogroup C Clinical Isolates , 2010, Journal of bacteriology.

[2]  I. Choi,et al.  Complete Genome Sequence of Japanese Erwinia Strain Ejp617, a Bacterial Shoot Blight Pathogen of Pear , 2010, Journal of bacteriology.

[3]  C. Finch,et al.  Next-generation sequencing in aging research: Emerging applications, problems, pitfalls and possible solutions , 2010, Ageing Research Reviews.

[4]  B. Mishra,et al.  Comparing De Novo Genome Assembly: The Long and Short of It , 2011, PloS one.

[5]  M. Ronaghi,et al.  A Sequencing Method Based on Real-Time Pyrophosphate , 1998, Science.

[6]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[7]  Ralf Hofestädt,et al.  Computer Science and Biology , 1997 .

[8]  Roger E Bumgarner,et al.  The genome of the domesticated apple (Malus × domestica Borkh.) , 2010, Nature Genetics.

[9]  M. Blaxter,et al.  Comparing de novo assemblers for 454 transcriptome data , 2010, BMC Genomics.

[10]  John A. C. Archer,et al.  Read Length and Repeat Resolution: Exploring Prokaryote Genomes Using Next-Generation Sequencing Technologies , 2010, PloS one.

[11]  Genome Sequence of Leuconostoc inhae KCTC 3774, Isolated from Kimchi , 2010, Journal of bacteriology.

[12]  R. Rappuoli,et al.  Reverse vaccinology, a genome-based approach to vaccine development. , 2001 .

[13]  Daniel H. Huson,et al.  48. MetaSim: A Sequencing Simulator for Genomics and Metagenomics , 2011 .

[14]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[15]  Aldert L. Zomer,et al.  Complete Genome Sequence of Bifidobacterium bifidum S17 , 2010, Journal of bacteriology.

[16]  C. Ponting,et al.  Genome assembly quality: assessment and improvement using the neutral indel model. , 2010, Genome research.

[17]  Shaun D Jackman,et al.  Assembling genomes using short-read sequencing technology , 2010, Genome Biology.

[18]  Elaine R. Mardis,et al.  Application of a superword array in genome assembly , 2006, Nucleic acids research.

[19]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[20]  Woojun Park,et al.  Complete Genome Sequence of the Diesel-Degrading Acinetobacter sp. Strain DR1 , 2010, Journal of bacteriology.

[21]  G Vida,et al.  The origin of eukaryotes: the difference between prokaryotic and eukaryotic cells , 1999, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[22]  Hui Shen,et al.  Comparative studies of de novo assembly tools for next-generation sequencing technologies , 2011, Bioinform..

[23]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[24]  Bairong Shen,et al.  A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies , 2011, PloS one.

[25]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[26]  Dustin A. Cartwright,et al.  A High Quality Draft Consensus Sequence of the Genome of a Heterozygous Grapevine Variety , 2007, PloS one.

[27]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[28]  Mihai Pop,et al.  Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies , 2011, BMC Bioinformatics.

[29]  Mihai Pop,et al.  Assembly complexity of prokaryotic genomes using short reads , 2010, BMC Bioinformatics.

[30]  F. Salzano Evolutionary change--patterns and processes. , 2005, Anais da Academia Brasileira de Ciencias.

[31]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[32]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[33]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[34]  Thomas Wetter,et al.  Genome Sequence Assembly Using Trace Signals and Additional Sequence Information , 1999, German Conference on Bioinformatics.

[35]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[36]  J. M. Rodríguez,et al.  Complete Genome Sequence of Lactobacillus fermentum CECT 5716, a Probiotic Strain Isolated from Human Milk , 2010, Journal of bacteriology.

[37]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[38]  M. Ronaghi,et al.  Real-time DNA sequencing using detection of pyrophosphate release. , 1996, Analytical biochemistry.

[39]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.