Tries for Approximate String Matching

Tries offer text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers, case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern differs from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a trie based method whose cost is independent of document size. Our experiments show that this new method significantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small files, only at k=2. For larger files, complexity arguments indicate that tries will outperform the linear methods for larger values of k. The indexes combine suffixes and so are compact in storage. When the text itself does not need to be stored, as in a spelling checker, we even obtain negative overhead: 50% compression. We discuss a variety of applications and extensions, including best match (for spelling checkers), case insensitivity, and limited approximate regular expression matching.

[1]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1996, Inf. Process. Lett..

[2]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[3]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[4]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[5]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[6]  WuSun,et al.  Fast text searching , 1992 .

[7]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[8]  T. H. Merrett,et al.  Trie Methods for Representing Text , 1993, FODO.

[9]  Heping Shang Trie Methods for Text and Spatial Data on Secondary Storage , 1994 .

[10]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[11]  Graham A. Stephen String Searching Algorithms , 1994, Lecture Notes Series on Computing.

[12]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[13]  Ricardo A. Baeza-Yates,et al.  String Searching Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[14]  Graham A Stephen,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[15]  Gaston H. Gonnet,et al.  Efficient Text Searching of Regular Expressions , 1989, WADS.

[16]  Gaston H. Gonnet,et al.  A new approach to text searching , 1989, SIGIR '89.

[17]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[18]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[19]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[20]  Gaston H. Gonnet,et al.  Efficient Text Searching of Regular Expressions (Extended Abstract) , 1989, ICALP.

[21]  Eugene L. Lawler,et al.  Approximate string matching in sublinear expected time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[22]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[23]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[24]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[25]  Jack A. Orenstein Multidimensional Tries Used for Associative Searching , 1982, Inf. Process. Lett..

[26]  T H Merrett,et al.  Relational Information Systems , 1983 .

[27]  Jean Véronis,et al.  Computerized correction of phonographic errors , 1988, Comput. Humanit..

[28]  John Shawe-Taylor,et al.  An Approximate String-Matching Algorithm , 1992, Theor. Comput. Sci..

[29]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .