Building Fragment Assembly String Graphs

We present a concept and formalism, the string graph, that represents all that is inferable about a DNA sequence from a collection of shotgun sequencing reads collected from it. We give time and space efficient algorithms for constructing a string graph given t he collection of overlaps between the reads and in particular, present a novel linear expected time algorithm for transitive reduction in this context. The result demonstrates that the decomposition of reads into k-mers employed in the de Bruijn graph approach of Pevzner et al. is not essential and in fact creates both efficiency problems and unecessary conceptual complexities. The current paper is the first in a series and presents the basic algorithm and preliminary results that demonstrate the efficiency and scalability of the method. The result is a step toward a next-generation whole genome shotgun assembler that will easily scale to mammalian genomes.

[1]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[2]  J. Weber,et al.  Human whole-genome shotgun sequencing. , 1997, Genome research.

[3]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[4]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[5]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[6]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[7]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[8]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[9]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[10]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[11]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[13]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[14]  Kim R. Rasmussen,et al.  Efficient q-Gram Filters for Finding All-Matches Over a Given Length , 2005 .

[15]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[16]  Eugene W. Myers,et al.  A Dataset Generator for Whole Genome Shotgun Sequencing , 1999, ISMB.

[17]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[18]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .