论文信息 - Multiple filtration and approximate pattern matching

Multiple filtration and approximate pattern matching

Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.

Pavel A. Pevzner | Michael S. Waterman | M. Waterman | P. Pevzner

[1] Raffaele Giancarlo,et al. Parallel String Matching with k Mismatches , 1987, Theor. Comput. Sci..

[2] Esko Ukkonen,et al. Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[3] Udi Manber,et al. Fast text searching: allowing errors , 1992, CACM.

[4] Ricardo A. Baeza-Yates,et al. Fast and Practical Approximate String Matching , 1996, Inf. Process. Lett..

[5] Raffaele Giancarlo,et al. Data structures and algorithms for approximate string matching , 1988, J. Complex..

[6] Gad M. Landau,et al. Efficient String Matching with k Mismatches , 2018, Theor. Comput. Sci..

[7] Richard M. Karp,et al. Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[8] Uzi Vishkin,et al. Deterministic Sampling - A New Technique for Fast Pattern Matching , 1991, SIAM J. Comput..

[9] J. Maizel,et al. Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[10] D. Lipman,et al. Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[11] Andrew Hume,et al. Fast string searching , 1991, USENIX Summer.

[12] Eugene W. Myers,et al. Computer program for the IBM personal computer which searches for approximate matches to short oligonucleotide sequences in long target DNA sequences , 1986, Nucleic Acids Res..

[13] Zvi Galil,et al. Time-Space-Optimal String Matching , 1983, J. Comput. Syst. Sci..

[14] A G Ivanov. RECOGNITION OF AN APPROXIMATE OCCURRENCE OF WORDS ON A TURING MACHINE IN REAL TIME , 1985 .

[15] William Feller,et al. An Introduction to Probability Theory and Its Applications , 1967 .

[16] Fabrizio Luccio,et al. Simple and Efficient String Matching with k Mismatches , 1989, Inf. Process. Lett..

[17] Udi Manber,et al. An Algorithm for Approximate Membership checking with Application to Password Security , 1994, Inf. Process. Lett..

[18] Esko Ukkonen,et al. Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[19] D. R. McGregor,et al. Fast approximate string matching , 1988, Softw. Pract. Exp..

[20] D. Lipman,et al. Rapid and sensitive protein similarity searches. , 1985, Science.

[21] Esko Ukkonen,et al. Boyer-Moore Approach to Approximate String Matching (Extended Abstract) , 1990, SWAT.

[22] Zvi Galil,et al. An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..

[23] John Shawe-Taylor,et al. An Approximate String-Matching Algorithm , 1992, Theor. Comput. Sci..

[24] Malcolm C. Harrison,et al. Implementation of the substring test by hashing , 1971, CACM.

[25] Gad M. Landau,et al. Locating alignments with k differences for nucleotide and amino acid sequences , 1988, Comput. Appl. Biosci..

[26] J. Wrench. Table errata: The art of computer programming, Vol. 2: Seminumerical algorithms (Addison-Wesley, Reading, Mass., 1969) by Donald E. Knuth , 1970 .

[27] Donald E. Knuth,et al. Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[28] Eugene L. Lawler,et al. Approximate string matching in sublinear expected time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[29] David Haussler,et al. A new distance metric on strings computable in linear time , 1988, Discret. Appl. Math..

[30] Gaston H. Gonnet,et al. A new approach to text searching , 1992, CACM.

[31] B. Blaisdell. A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[32] Gad M. Landau,et al. Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[33] Philippe Dessen,et al. A computer program for the design of optimal synthetic oligonucleotide probes for protein coding genes , 1987, Comput. Appl. Biosci..

[34] Feller William,et al. An Introduction To Probability Theory And Its Applications , 1950 .

[35] Donald Ervin Knuth,et al. The Art of Computer Programming , 1968 .

[36] Gad M. Landau,et al. Efficient string matching in the presence of errors , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[37] Eugene W. Myers,et al. A sublinear algorithm for approximate keyword searching , 1994, Algorithmica.

[38] J. P. Dumas,et al. Efficient algorithms for folding and comparing nucleic acid sequences , 1982, Nucleic Acids Res..

[39] Z Galil,et al. Improved string matching with k mismatches , 1986, SIGA.

[40] Robert S. Boyer,et al. A fast string searching algorithm , 1977, CACM.