Over-Generative Finite State Transducer N-Gram for Out-of-Vocabulary Word Recognition

Hybrid statistical grammars both at word and character levels can be used to perform open-vocabulary recognition. This is usually done by allowing the special symbol for unknown-word in the word-level grammar and dynamically replacing it by a (long) n-gramat character-level, as the full transducer does not fit in the memory of most current computers. We present a modification of a finite-state-transducer (fst) n-gram that enables the creation of a static transducer, i.e. when it is not possible to perform on-demand composition. By combining paths in the "LG" transducer (composition of lexicon and n-gram)making it over-generative with respect to the n-grams observed in the corpus, it is possible to reduce the number of actual occurrences of the character-level grammar, the resulting transducer fits the memory of practical machines. We evaluate this model for handwriting recognition using the RIMES and the IAM dabases. We study its effect on the vocabulary size and show that this model is competitive with state-of-the-art solutions.

[1]  Hermann Ney,et al.  Open vocabulary handwriting recognition using combined word-level and character-level language models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[3]  Geoffrey Leech,et al.  The tagged LOB Corpus : user's manual , 1986 .

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Christopher Kermorvant,et al.  The A2iA French handwriting recognition system at the Rimes-ICDAR2011 competition , 2012, Electronic Imaging.

[6]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[7]  Chafic Mokbel,et al.  Using the Web to Create Dynamic Dictionaries in Handwritten Out-of-Vocabulary Word Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[8]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[9]  Farès Menasri,et al.  The A 2 iA French handwriting recognition system at the Rimes-ICDAR 2011 competition , 2011 .

[10]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[11]  Haikal El Abed,et al.  ICDAR 2011 - French Handwriting Recognition Competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Christopher Kermorvant,et al.  The A2iA Arabic Handwritten Text Recognition System at the Open HaRT2013 Evaluation , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.