Multiple filtration and approximate pattern matching

Given a text of lengthn and a query of lengthq, we present an algorithm for finding all locations ofm-tuples in the text and in the query that differ by at mostk mismatches. This problem is motivated by the dot-matrix constructions for sequence comparison and optimal oligonucleotide probe selection routinely used in molecular biology. In the caseq=m the problem coincides with the classicalapproximate string matching with k mismatches problem. We present a new approach to this problem based on multiple hashing, which may have advantages over some sophisticated and theoretically efficient methods that have been proposed. This paper describes a two-stage process. The first stage (multiple filtration) uses a new technique to preselect roughly similarm-tuples. The second stage compares thesem-tuples using an accurate method. We demonstrate the advantages of multiple filtration in comparison with other techniques for approximate pattern matching.

[1]  Raffaele Giancarlo,et al.  Parallel String Matching with k Mismatches , 1987, Theor. Comput. Sci..

[2]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[3]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[4]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1996, Inf. Process. Lett..

[5]  Raffaele Giancarlo,et al.  Data structures and algorithms for approximate string matching , 1988, J. Complex..

[6]  Gad M. Landau,et al.  Efficient String Matching with k Mismatches , 2018, Theor. Comput. Sci..

[7]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[8]  Uzi Vishkin,et al.  Deterministic Sampling - A New Technique for Fast Pattern Matching , 1991, SIAM J. Comput..

[9]  J. Maizel,et al.  Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[10]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Andrew Hume,et al.  Fast string searching , 1991, USENIX Summer.

[12]  Eugene W. Myers,et al.  Computer program for the IBM personal computer which searches for approximate matches to short oligonucleotide sequences in long target DNA sequences , 1986, Nucleic Acids Res..

[13]  Zvi Galil,et al.  Time-Space-Optimal String Matching , 1983, J. Comput. Syst. Sci..

[14]  A G Ivanov RECOGNITION OF AN APPROXIMATE OCCURRENCE OF WORDS ON A TURING MACHINE IN REAL TIME , 1985 .

[15]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[16]  Fabrizio Luccio,et al.  Simple and Efficient String Matching with k Mismatches , 1989, Inf. Process. Lett..

[17]  Udi Manber,et al.  An Algorithm for Approximate Membership checking with Application to Password Security , 1994, Inf. Process. Lett..

[18]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[19]  D. R. McGregor,et al.  Fast approximate string matching , 1988, Softw. Pract. Exp..

[20]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[21]  Esko Ukkonen,et al.  Boyer-Moore Approach to Approximate String Matching (Extended Abstract) , 1990, SWAT.

[22]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1989, SIAM J. Comput..

[23]  John Shawe-Taylor,et al.  An Approximate String-Matching Algorithm , 1992, Theor. Comput. Sci..

[24]  Malcolm C. Harrison,et al.  Implementation of the substring test by hashing , 1971, CACM.

[25]  Gad M. Landau,et al.  Locating alignments with k differences for nucleotide and amino acid sequences , 1988, Comput. Appl. Biosci..

[26]  J. Wrench Table errata: The art of computer programming, Vol. 2: Seminumerical algorithms (Addison-Wesley, Reading, Mass., 1969) by Donald E. Knuth , 1970 .

[27]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[28]  Eugene L. Lawler,et al.  Approximate string matching in sublinear expected time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[29]  David Haussler,et al.  A new distance metric on strings computable in linear time , 1988, Discret. Appl. Math..

[30]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.

[31]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[33]  Philippe Dessen,et al.  A computer program for the design of optimal synthetic oligonucleotide probes for protein coding genes , 1987, Comput. Appl. Biosci..

[34]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[35]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[36]  Gad M. Landau,et al.  Efficient string matching in the presence of errors , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[37]  Eugene W. Myers,et al.  A sublinear algorithm for approximate keyword searching , 1994, Algorithmica.

[38]  J. P. Dumas,et al.  Efficient algorithms for folding and comparing nucleic acid sequences , 1982, Nucleic Acids Res..

[39]  Z Galil,et al.  Improved string matching with k mismatches , 1986, SIGA.

[40]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.