Breadth-first search strategies for trie-based syntactic pattern recognition

Dictionary-based syntactic pattern recognition of strings attempts to recognize a transmitted string X*, by processing its noisy version, Y, without sequentially comparing Y with every element X in the finite, (but possibly, large) dictionary, H. The best estimate X+ of X*, is defined as that element of H which minimizes the generalized Levenshtein distance (GLD) D(X, Y) between X and Y, for all X ∈H. The non-sequential PR computation of X+ involves a compact trie-based representation of H. In this paper, we show how we can optimize this computation by incorporating breadth first search schemes on the underlying graph structure. This heuristic emerges from the trie-based dynamic programming recursive equations, which can be effectively implemented using a new data structure called the linked list of prefixes that can be built separately or “on top of” the trie representation of H. The new scheme does not restrict the number of errors in Y to be merely a small constant, as is done in most of the available methods. The main contribution is that our new approach can be used for generalized GLDs and not merely for 0/1 costs. It is also applicable when all possible correct candidates need to be known, and not just the best match. These constitute the cases when the “cutoffs” cannot be used in the DFS trie-based technique (Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540–547, 1996). The new technique is compared with the DFS trie-based technique (Risvik in United Patent 6377945 B1, 23 April 2002; Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540–547, 1996) using three large and small benchmark dictionaries with different errors. In each case, we demonstrate marked improvements with regard to the operations needed up to 21%, while at the same time maintaining the same accuracy. Additionally, some further improvements can be obtained by introducing the knowledge of the maximum number or percentage of errors in Y.

[1]  Enrique Vidal,et al.  Efficient Error-Correcting Viterbi Parsing , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  B. John Oommen Recognition of Noisy Subsequences Using Constrained Edit Distances , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Klaus U. Schulz,et al.  Fast string correction with Levenshtein automata , 2002, International Journal on Document Analysis and Recognition.

[4]  B. John Oommen,et al.  An effective algorithm for string correction using generalized edit distances--I. Description of the algorithm and its optimality , 1981, Inf. Sci..

[5]  M. W. Du,et al.  An Approach to Designing Very Fast Approximate String Matching Algorithms , 1994, IEEE Trans. Knowl. Data Eng..

[6]  Philippe Flajolet,et al.  The analysis of hybrid trie structures , 1998, SODA '98.

[7]  B. John Oommen,et al.  Spelling correction using probabilistic methods , 1984, Pattern Recognit. Lett..

[8]  Philipp Bucher,et al.  A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System , 1996, ISMB.

[9]  Robert A. Wagner,et al.  Order-n correction for regular languages , 1974, CACM.

[10]  B. John Oommen,et al.  A formal theory for optimal and information theoretic syntactic pattern recognition , 1998, Pattern Recognit..

[11]  Tamotsu Kasai,et al.  A Method for the Correction of Garbled Words Based on the Levenshtein Metric , 1976, IEEE Transactions on Computers.

[12]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[13]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[14]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.

[15]  G. Z. Sun,et al.  Grammatical Inference , 1998, Lecture Notes in Computer Science.

[16]  Ghada Hany Badr,et al.  Dictionary-Based Syntactic Pattern Recognition Using Tries , 2004, SSPR/SPR.

[17]  B. John Oommen Constrained string editing , 1986, Inf. Sci..

[18]  Georgios C. Anagnostopoulos,et al.  Structural and syntactic pattern recognition (SSPR 2008) and statistical techniques in pattern recognition (SPR 2008) , 2008, ICPR 2008.

[19]  B.J. Oommen,et al.  Pattern recognition of strings with substitutions, insertions, deletions and generalized transpositions , 1997, Pattern Recognit..

[20]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[21]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[22]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[23]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[24]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[25]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[26]  Mischa Schwartz,et al.  Two extensions of the Viterbi algorithm , 1991, IEEE Trans. Inf. Theory.

[27]  B. John Oommen,et al.  Designing syntactic pattern classifiers using vector quantization and parametric string editing , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[28]  Horst Bunke,et al.  Fast approximate matching of words against a dictionary , 1995, Computing.

[29]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[30]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[31]  Ricardo A. Baeza-Yates,et al.  Fast approximate string matching in a dictionary , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[32]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[33]  Godfrey Dewey,et al.  Relativ frequency of English speech sounds , 1923 .

[34]  János Csirik,et al.  Parametric string edit distance and its application to pattern recognition , 1995, IEEE Trans. Syst. Man Cybern..

[35]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[36]  Kai Shen,et al.  Adaptive Algorithms for Cache-Efficient Trie Search , 1998, ALENEX.

[37]  C. H. Chen,et al.  Handbook of Pattern Recognition and Computer Vision , 1993 .

[38]  T. H. Merrett,et al.  Tries for Approximate String Matching , 1996, IEEE Trans. Knowl. Data Eng..

[39]  Rafael Llobet,et al.  Stochastic error-correcting parsing for OCR post-processing , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.