An increasingly important problem in genome sequencing is the failure of the commonly used shotgun assembly programs to correctly assemble repetitive sequences. The assembly of non-repetitive regions or regions containing repeats considerably shorter than the average read length is in practice easy to solve, while longer repeats have been a difficult problem. We here present a statistical method to separate arbitrarily long, almost identical repeats, which makes it possible to correctly assemble complex repetitive sequence regions. The differences between repeat units may be as low as 1% and the sequencing error may be up to ten times higher. The method is based on the realization that a comparison of only a part of all overlapping sequences at a time in a data set does not generate enough information for a conclusive analysis. Our method uses optimal multi-alignments consisting of all the overlaps of each read. This makes it possible to determine defined nucleotide positions, DNPs, which constitute the differences between the repeat units. Differences between repeats are distinguished from sequencing errors using statistical methods, where the probabilities of obtaining certain combinations of candidate DNPs are calculated using the information from the multi-alignments. The use of DNPs and combinations of DNPs will allow for optimal and rapid assemblies of repeated regions. This method can solve repeats that differ in only two positions in a read length, which is the theoretical limit for repeat separation. We predict that this method will be highly useful in shotgun sequencing in the future.
[1]
Björn Andersson,et al.
TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences
,
2003,
Comput. Methods Programs Biomed..
[2]
Eugene W. Myers,et al.
ReAligner: a program for refining DNA sequence multi-alignments
,
1997,
RECOMB '97.
[3]
H. Redkey,et al.
A new approach.
,
1967,
Rehabilitation record.
[4]
John D. Kececioglu,et al.
Separating repeats in DNA sequence assembly
,
2001,
RECOMB.
[5]
P Green,et al.
Base-calling of automated sequencer traces using phred. II. Error probabilities.
,
1998,
Genome research.
[6]
S. Ross.
A First Course in Probability
,
1977
.
[7]
Haixu Tang,et al.
A new approach to fragment assembly in DNA sequencing
,
2001,
RECOMB.
[8]
P. Green,et al.
Base-calling of automated sequencer traces using phred. I. Accuracy assessment.
,
1998,
Genome research.