The fragment assembly string graph

We present a concept and formalism, the string graph, which represents all that is inferable about a DNA sequence from a collection of shotgun sequencing reads collected from it. We give time and space efficient algorithms for constructing a string graph given the collection of overlaps between the reads and, in particular, present a novel linear expected time algorithm for transitive reduction in this context. The result demonstrates that the decomposition of reads into kmers employed in the de Bruijn graph approach described earlier is not essential, and exposes its close connection to the unitig approach we developed at Celera. This paper is a preliminary piece giving the basic algorithm and results that demonstrate the efficiency and scalability of the method. These ideas are being used to build a next-generation whole genome assembler called BOA (Berkeley Open Assembler) that will easily scale to mammalian genomes.

[1]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[2]  David K. Smith Network Flows: Theory, Algorithms, and Applications , 1994 .

[3]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[4]  J. Weber,et al.  Human whole-genome shotgun sequencing. , 1997, Genome research.

[5]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[6]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[7]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[8]  Eugene W. Myers,et al.  A Dataset Generator for Whole Genome Shotgun Sequencing , 1999, ISMB.

[9]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[10]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[11]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[13]  Kim R. Rasmussen,et al.  Efficient q-Gram Filters for Finding All-Matches Over a Given Length , 2005 .

[14]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[15]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[16]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[17]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.