A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads

MOTIVATION Novel high-throughput sequencing technologies pose new algorithmic challenges in handling massive amounts of short-read, high-coverage data. A robust and versatile consensus tool is of particular interest for such data since a sound multi-read alignment is a prerequisite for variation analyses, accurate genome assemblies and insert sequencing. RESULTS A multi-read alignment algorithm for de novo or reference-guided genome assembly is presented. The program identifies segments shared by multiple reads and then aligns these segments using a consistency-enhanced alignment graph. On real de novo sequencing data obtained from the newly established NCBI Short Read Archive, the program performs similarly in quality to other comparable programs. On more challenging simulated datasets for insert sequencing and variation analyses, our program outperforms the other tools. AVAILABILITY The consensus program can be downloaded from http://www.seqan.de/projects/consensus.html. It can be used stand-alone or in conjunction with the Celera Assembler. Both application scenarios as well as the usage of the tool are described in the documentation.

[1]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[2]  Inge Jonassen,et al.  A graph based algorithm for generating EST consensus sequences , 2005, Bioinform..

[3]  G. Weinstock,et al.  The Atlas genome assembly system. , 2004, Genome research.

[4]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[5]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[6]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[7]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[8]  Knut Reinert,et al.  Segment-based multiple sequence alignment , 2008, ECCB.

[9]  Ben Shneiderman,et al.  Hawkeye: an interactive visual analytics tool for genome assemblies , 2007, Genome Biology.

[10]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[11]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[12]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[13]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[14]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[15]  Aaron L. Halpern,et al.  Consensus generation and variant detection by Celera Assembler , 2008, Bioinform..

[16]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[17]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[18]  Eugene W. Myers,et al.  ReAligner: a program for refining DNA sequence multi-alignments , 1997, RECOMB '97.

[19]  M. Waterman,et al.  The accuracy of DNA sequences: estimating sequence quality. , 1992, Genomics.

[20]  John D. Kececioglu,et al.  Separating repeats in DNA sequence assembly , 2001, RECOMB.

[21]  Mihai Pop,et al.  Minimus: a fast, lightweight genome assembler , 2007, BMC Bioinformatics.

[22]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[23]  Kiem-Phong Vo,et al.  Heaviest Increasing/Common Subsequence Problems , 1992, CPM.

[24]  O. Gotoh Consistency of optimal sequence alignments. , 1990, Bulletin of Mathematical Biology.

[25]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[26]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[27]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[28]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[29]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[30]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.