An Online approach to Overlap Detection for DNA Fragment Assembly
暂无分享,去创建一个
We consider the problem of estimating the number of sequence s overlapping with a certain length from a large dataset of sequences. Such a problem is central to several challenging bioinformatics appli cations like DNA fragment assembly. A naive approach of calculating this num beris to perform an all-pairs comparison over the dataset. We propose an entirely different approach to tackle the problem by considering random sa ples from the dataset and performing a ripple-join based comparison b etween two randomized copies of the dataset. One key feature of our approac h is that at each step of the ripple join it produces an estimate of the eve ntual number of fragments having a certain overlap length along with confide ce bounds in an online fashion. Since this estimate converges after a suf ficient number of samples are considered, the algorithm can be terminated muc h earlier than the point where it exhausts processing the entire dataset. W e present experimental results that indicate that our approach produces suf ficiently accurate estimates in a fraction of the time required by a naive approa ch.
[1] M. Adams,et al. Shotgun Sequencing of the Human Genome , 1998, Science.
[2] Peter J. Haas,et al. Ripple joins for online aggregation , 1999, SIGMOD '99.