Weighting Finite-State Transductions With Neural Context

How should one apply deep learning to tasks such as morphological reinflection, which stochastically edit one string to get another? A recent approach to such sequence-to-sequence tasks is to compress the input string into a vector that is then used to generate the output string, using recurrent neural networks. In contrast, we propose to keep the traditional architecture, which uses a finite-state transducer to score all possible output strings, but to augment the scoring function with the help of recurrent networks. A stack of bidirectional LSTMs reads the input string from leftto-right and right-to-left, in order to summarize the input context in which a transducer arc is applied. We combine these learned features with the transducer to define a probability distribution over aligned output strings, in the form of a weighted finite-state automaton. This reduces hand-engineering of features, allows learned features to examine unbounded context in the input string, and still permits exact inference through dynamic programming. We illustrate our method on the tasks of morphological reinflection and lemmatization.

[1]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[6]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Francisco Casacuberta,et al.  Optimal linguistic decoding is a difficult computational problem , 1999, Pattern Recognit. Lett..

[8]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[9]  Jason Eisner,et al.  Parameter Estimation for Probabilistic Finite-State Transducers , 2002, ACL.

[10]  David Yarowsky,et al.  Modeling and learning multilingual inflectional morphology in a minimally supervised framework , 2003 .

[11]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[12]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[13]  Kevin Knight,et al.  A Better N-Best List: Practical Determinization of Weighted Finite Tree Automata , 2006, NAACL.

[14]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[15]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[16]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[17]  Markus Dreyer,et al.  Latent-Variable Modeling of String Transductions with Finite-State Methods , 2008, EMNLP.

[18]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[19]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[20]  Christopher D. Manning,et al.  Efficient, Feature-based, Conditional Random Field Parsing , 2008, ACL.

[21]  Zhifei Li,et al.  First- and Second-Order Expectation Semirings with Applications to Minimum-Risk Training on Translation Forests , 2009, EMNLP.

[22]  Sanjeev Khudanpur,et al.  Variational Decoding for Statistical Machine Translation , 2009, ACL.

[23]  Jian Peng,et al.  Conditional Neural Fields , 2009, NIPS.

[24]  Noah A. Smith,et al.  Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions , 2010, NAACL.

[25]  Thierry Artières,et al.  Neural conditional random fields , 2010, AISTATS.

[26]  Thierry Artières,et al.  Joint Optimization of Hidden Conditional Random Fields and Non Linear Feature Extraction , 2011, 2011 International Conference on Document Analysis and Recognition.

[27]  Markus Dreyer,et al.  A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings , 2011 .

[28]  Seiichi Nakagawa,et al.  Hidden Conditional Neural Fields for Continuous Phoneme Speech Recognition , 2012, IEICE Trans. Inf. Syst..

[29]  Ryan Cotterell,et al.  Stochastic Contextual Edit Distance and Probabilistic FSTs , 2014, ACL.

[30]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[31]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[32]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[33]  Noah A. Smith,et al.  Transition-Based Dependency Parsing with Stack Long Short-Term Memory , 2015, ACL.

[34]  Dan Klein,et al.  Neural CRF Parsing , 2015, ACL.

[35]  Alan L. Yuille,et al.  Learning Deep Structured Models , 2014, ICML.

[36]  Kuzman Ganchev,et al.  Semantic Role Labeling with Neural Network Factors , 2015, EMNLP.

[37]  Geoffrey Zweig,et al.  Sequence-to-sequence neural net models for grapheme-to-phoneme conversion , 2015, INTERSPEECH.

[38]  Phil Blunsom,et al.  Learning to Transduce with Unbounded Memory , 2015, NIPS.

[39]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[40]  Slav Petrov,et al.  Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[41]  Yulia Tsvetkov,et al.  Morphological Inflection Generation Using Character Sequence to Sequence Learning , 2015, NAACL.

[42]  Noah A. Smith,et al.  Segmental Recurrent Neural Networks , 2015, ICLR.