A Preprocessor for Shotgun Assembly of Large Genomes

The whole-genome shotgun (WGS) assembly technique has been remarkably successful in efforts to determine the sequence of bases that make up a genome. WGS assembly begins with a large collection of short fragments that have been selected at random from a genome. The sequence of bases at each end of the fragment is determined, albeit imprecisely, resulting in a sequence of letters called a "read." Each letter in a read is assigned a quality value, which estimates the probability that a sequencing error occurred in determining that letter. Reads are typically cut off after about 500 letters, where sequencing errors become endemic. We report on a set of procedures that (1) corrects most of the sequencing errors, (2) changes quality values accordingly, and (3) produces a list of "overlaps," i.e., pairs of reads that plausibly come from overlapping parts of the genome. Our procedures, which we call collectively the "UMD Overlapper," can be run iteratively and as a preprocessor for other assemblers. We tested the UMD Overlapper on Celera's Drosophila reads. When we replaced Celera's overlap procedures in the front end of their assembler, it was able to produce a significantly improved genome.

[1]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[2]  P. Green,et al.  Against a whole-genome shotgun. , 1997, Genome research.

[3]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[4]  J. Weber,et al.  Human whole-genome shotgun sequencing. , 1997, Genome research.

[5]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[6]  Owen White,et al.  TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects , 1995 .

[7]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[8]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[10]  S. Kim,et al.  AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly , 1998, J. Comput. Biol..

[11]  Craig A. Stewart,et al.  Introduction to computational biology , 2005 .

[12]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.

[13]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[14]  Andrew Smith Genome sequence of the nematode C-elegans: A platform for investigating biology , 1998 .

[15]  X. Huang,et al.  An improved sequence assembly program. , 1996, Genomics.

[16]  John D. Kececioglu,et al.  Separating repeats in DNA sequence assembly , 2001, RECOMB.

[17]  S. Oliver,et al.  Erratum: Overview of the yeast genome , 1997, Nature.

[18]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[19]  J. Bonfield,et al.  A new DNA sequence assembly program. , 1995, Nucleic acids research.

[20]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[21]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[22]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[23]  H. Mewes,et al.  Overview of the yeast genome. , 1997, Nature.

[24]  Haixu Tang,et al.  A new approach to fragment assembly in DNA sequencing , 2001, RECOMB.

[25]  X. Huang,et al.  A contig assembly program based on sensitive detection of fragment overlaps. , 1992, Genomics.