Efficient Online k-Best Lookup in Weighted Finite-State Cascades

Weighted finite-state transducers (WFSTs) have proved to be powerful and efficient aids for a variety of natural-language processing tasks, including automatic phonetization and phonological rule systems (Kaplan & Kay, 1994; Laporte, 1997), morphological analysis (Geyken & Hanneforth, 2006), and shallow syntactic parsing (Roche, 1997). In particular, cascades arising from the composition of two or more WFSTs can be used to model processing “pipelines”, each component of which is itself a (weighted) finite-state transducer. Typically, the input to such a pipeline is a simple string, corresponding to a lookup operation for the input string in the processing cascade. Unfortunately, an exhaustive “offline” compilation of the processing cascade turns out in many cases to be infeasible, due to memory restrictions and the combinatorial properties of the composition operation itself. Even for simple lookup operations in “dense” cascades,1 the resulting WFST may in fact be several times larger than the processing pipeline itself. In many such cases – particularly in optimization and error-correction problems – the output WFST itself serves only as an intermediate processing datum, however: we are not interested in an exhaustive representation of the lookup output, but rather only in a small finite subset of its language, such as the k-best paths. This paper presents a novel algorithm for efficient k-best search in a subclass of weighted finite-state lookup cascades which avoids the combinatorial explosion associated with “dense” cascade relations by means of online computation:2 dynamic construction of only those states and arcs required for a k-best search of the lookup output. Use of a greedy termination clause together with an additional cutoff parameter helps to ensure speedy completion and simultaneously prune unwanted results from the output.

[1]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[2]  Robert E. Tarjan,et al.  Fibonacci heaps and their uses in improved network optimization algorithms , 1984, JACM.

[3]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[4]  Thomas Hanneforth,et al.  TAGH: A Complete Morphology for German Based on Weighted Finite State Automata , 2005, FSMNLP.

[5]  Kemal Oflazer,et al.  Spelling Correction in Agglutinative Languages , 1994, ANLP.

[6]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[7]  J. V. Leeuwen Rational Transductions for Phonetic Conversion and Phonology , 1997 .

[8]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[9]  Cyril Allauzen,et al.  Linear-Space Computation of the Edit-Distance between a String and a Finite Automaton , 2009, ArXiv.

[10]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[11]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Mehryar Mohri,et al.  Semiring Frameworks and Algorithms for Shortest-Distance Problems , 2002, J. Autom. Lang. Comb..

[14]  Imre Simon The Nondeterministic Complexity of a Finite Automaton , 1987 .

[15]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[16]  Fernando Pereira,et al.  Weighted Automata in Text and Speech Processing , 2005, ArXiv.

[17]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[18]  Yves Schabes,et al.  Parsing with Finite-State Transducers , 1997 .

[19]  Bryan Jurish Finding canonical forms for historical German text , 2008, KONVENS.

[20]  Zoltán Ésik,et al.  Equational Axioms for a Theory of Automata , 2004 .