Genome Sequence Assembly Using Trace Signals and Additional Sequence Information

Motivation: This article presents a method for assembling shotgun sequences which primarily uses high confidence regions whilst taking advantage of additional available information such as low confidence regions, quality values or repetitive region tags. Conflict situations are resolved with routines for analysing trace signals. Results: Initial tests with different human and mouse genome projects showed promising results but also demonstrated the need to recognise and handle correctly very long, untagged and nonstandard repeats. Availability: Current versions of the MIRA assembler are available on request at the canonical project homepage as binary for SGI, Intel Linux and SUN Solaris: http://www.dkfz-heidelberg.de/mbp-ased/ Contact: b.chevreux@dkfz-heidelberg.de To whom correspondance should be adressed Introduction Todays large scale genome sequencing efforts produce enormous quantities of data each day. They are now nearly all based on the chain-termination dideoxy method published by Sanger et al. (1977) in one way or another. But the gel or capillary electrophoresis used can determine only about a maximum of 1000 to 1500 bases, the high quality stretch with low error probabilities for the called bases often being around the first 400 to 500 bases. Current sequencing strategies for a contiguous DNA sequence (contig) – ranging anywhere between 20 kilobases (kb) and 200 kb – therefore basically boil down to fragment the given contig in hundreds or thousands of overlapping subclones (Durbin and Dear (1998)), analyse these by electrophoresis and subsequently assemble the subclones back together in one contig. The extensively studied reconstruction of the unknown, correct contiguous DNA sequence by inferring it through the help of a number of representations1 is called the assembly problem. The devil is in the details, however. If the collected readings 1also called fragments, see Myers (1995) (reads) were 100% error free, then a multiplicity of problems would not occur. In reality, the extraction of data by gel electrophoresis is a physical process in which errors due to chemical artifacts like compressions show up quite often. Ewing et al. (1998); Ewing and Green (1998) show that – together with errors occuring in the subsequent signal analysis – current laboratory technologies total an error rate that might be anywhere between 0.1% – for good parts in the middle of a read – and more than 10% in bad parts of a read at the very beginning and at the end. This error rate, combined with the sometimes exacerbating fact that DNA tends to contain highly repetitive stretches with only very few bases differing across different repeat locations, impedes the assembly process in an awesome way. The above mentioned error rates and repetitive properties of DNA lead to the necessity of using fault tolerant and alternatives-seeking algorithms. Wang and Jiang (1994) showed that the assembly problem – even using error free representations (fragments) of the true sequence – is NP complete. This means that the volume of data can only be assembled by approximating strategies, relying on algorithms that are well-behaved in time and space complexity.

[1]  J. Bonfield,et al.  A new DNA sequence assembly program. , 1995, Nucleic acids research.

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  R Staden,et al.  The staden sequence analysis package , 1996, Molecular biotechnology.

[4]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[6]  Hans Söderlund,et al.  SEQAID: a DNA sequence assembling program based on a mathematical model , 1984, Nucleic Acids Res..

[7]  R. Durbin,et al.  Sequence assembly with CAFTOOLS. , 1998, Genome research.

[8]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.

[9]  D. K. Y. Chiu,et al.  A survey of multiple sequence comparison methods , 1992 .

[10]  X. Huang,et al.  An improved sequence assembly program. , 1996, Genomics.

[11]  Thomas Wetter,et al.  Computer Assisted Editing of Genomic Sequences - Why and How We Evaluated a Prototype , 1999, XPS.

[12]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[13]  J. Bonfield,et al.  Experiment files and their application during large-scale sequencing projects. , 1996, DNA sequence : the journal of DNA sequencing and mapping.

[14]  Andrew K. C. Wong,et al.  A genetic algorithm for multiple molecular sequence alignment , 1997, Comput. Appl. Biosci..

[15]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[16]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[17]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[18]  Jude W. Shavlik,et al.  Improving the Quality of Automatic DNA Sequence Assembly Using Fluorescent Trace-Data Classifications , 1996, ISMB.

[19]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[20]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[21]  X. Huang,et al.  On global sequence alignment , 1994, Comput. Appl. Biosci..

[22]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[23]  J. Stoye Multiple sequence alignment with the Divide-and-Conquer method. , 1998, Gene.

[24]  Stephanie Forrest,et al.  Genetic Algorithms for DNA Sequence Assembly , 1993, ISMB.

[25]  R. Durbin,et al.  Base qualities help sequencing software. , 1998, Genome research.