Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

[1]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[2]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[3]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[4]  Eric S. Lander,et al.  An SNP map of the human genome generated by reduced representation shotgun sequencing , 2000, Nature.

[5]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[6]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[7]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Eugene W. Myers,et al.  The greedy path-merging algorithm for sequence assembly , 2001, RECOMB.

[9]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[10]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[11]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[12]  Aaron L. Halpern,et al.  Efficiently detecting polymorphisms during the fragment assembly process , 2002, ISMB.

[13]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[14]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[15]  Jonathan L. Gross,et al.  Handbook of graph theory , 2007, Discrete mathematics and its applications.

[16]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[17]  Christopher J. Lee Generating Consensus Sequences from Partial Order Multiple Sequence Alignment Graphs , 2003, Bioinform..

[18]  Maulik K. Shah,et al.  An exhaustive genome assembly algorithm using k-mers to indirectly perform N-squared comparisons in O(N) , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[19]  Ewan Birney,et al.  Automated generation of heuristics for biological sequence comparison , 2005, BMC Bioinformatics.

[20]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[21]  G. Weinstock,et al.  The Atlas genome assembly system. , 2004, Genome research.

[22]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[23]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[24]  Jonghwan Kim,et al.  Mapping DNA-protein interactions in large genomes by sequence tag analysis of genomic enrichment , 2005, Nature Methods.

[25]  Yu Zhang,et al.  An Eulerian path approach to local multiple alignment for DNA sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Shahid H. Bokhari,et al.  A parallel graph decomposition algorithm for DNA sequencing with nanopores , 2005, Bioinform..

[27]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[28]  M. Metzker Emerging technologies in DNA sequencing. , 2005, Genome research.

[29]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[30]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[31]  P. Pevzner,et al.  Colored de Bruijn Graphs and the Genome Halving Problem , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  S. Batzoglou,et al.  Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies , 2007, PloS one.

[33]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[34]  E. Eichler,et al.  Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution , 2007, Nature Genetics.

[35]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[36]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[37]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[38]  M. Ibrahim Whole-Genome Resequencing , 2009 .