论文信息 - Estimating the Probability of Approximate Matches

Estimating the Probability of Approximate Matches

While considerable effort and some progress has been made on developing an analytic formula for the probability of an approximate match, such work has not achieved fruition [4, 6, 2, 1]. Therefore, we consider here the development of an unbiased estimation procedure for determining said probability given a specific string P ∈ Σ and a specific cost function δ for weighting edit operations. Problems of this type are of general interest, see for example a recent paper [5] giving an unbiased estimator for counting the words of a fixed length in a regular language. We were further motivated by a particular application arising in the pattern matching system Anrep designed by us for use in genomic sequence analysis [8, 11]. Anrep accomplishes a search for a complex pattern by backtracking over subprocedures that find approximate matches. The subpatterns are searched in an order that attempts to minimize the expected running time of the search. Determining this optimal backtrack order requires a reasonably accurate estimate of the probability with which one will find an approximate match to each subpattern. Given that the probabilities involved are frequently 10 or less, the simple expedient of measuring match frequency over a random text of several thousand characters has been less than satisfactory. The unbiased estimator herein is shown to give good results in a matter of a thousand samples even for small probability patterns. Thus it is expected to improve the performance of Anrep and may have utility in estimating the significance of similarity searches. Proceeding formally, suppose we are given

Eugene W. Myers | Stefan Kurtz | E. Myers | S. Kurtz

[1] Michael J. Fischer,et al. The String-to-String Correction Problem , 1974, JACM.

[2] Donald E. Knuth,et al. Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[3] M. O. Dayhoff. A model of evolutionary change in protein , 1978 .

[4] M. O. Dayhoff,et al. 22 A Model of Evolutionary Change in Proteins , 1978 .

[5] W. Fitch. Random sequences. , 1983, Journal of molecular biology.

[6] Esko Ukkonen,et al. Algorithms for Approximate String Matching , 1985, Inf. Control..

[7] S. Karlin,et al. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[8] Eugene W. Myers,et al. Approximate Matching of Network Expressions with Spacers , 1992, LATIN.

[9] Jordan Lampe,et al. Theoretical and Empirical Comparisons of Approximate String Matching Algorithms , 1992, CPM.

[10] G. Mehldau,et al. A system for pattern matching applications on biosequences , 1993, Comput. Appl. Biosci..

[11] Sampath Kannan,et al. Counting and random generation of strings in regular languages , 1995, SODA '95.