TV-gram language models for offline handwritten text recognition

This paper investigates the impact of bigram and trigram language models on the performance of a hidden Markov model (HMM) based offline recognition system for handwritten sentences. The language models are trained on the LOB corpus which is supplemented by various additional sources of text, including sentences from additional corpora and random sentences produced by a stochastic context-free grammar (SCFG). Experimental results are provided in terms of test set perplexity and performance of the corresponding recognition systems. For the text recognition experiments handwritten material from the IAM database has been used.

[1]  Geoffrey Leech,et al.  Manual of Information for the Lancaster Parsed Corpus , 1999 .

[2]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[3]  Mark Liberman,et al.  THE TDT-2 TEXT AND SPEECH CORPUS , 1999 .

[4]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[5]  Nikos Fakotakis,et al.  An unconstrained handwriting recognition system , 2002, International Journal on Document Analysis and Recognition.

[6]  Anthony J. Robinson,et al.  An Off-Line Cursive Handwriting Recognition System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Horst Bunke,et al.  Automatic segmentation of the IAM off-line database for handwritten English text , 2002, Object recognition supported by user interaction for service robots.

[8]  Maria Wolters,et al.  In Proc. European Conf. on Speech Communication and Technology , 1997 .

[9]  Horst Bunke,et al.  Hidden Markov model length optimization for handwriting recognition systems , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[10]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[11]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[12]  Samy Bengio,et al.  Offline recognition of large vocabulary cursive handwritten text , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[13]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[14]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[15]  Horst Bunke,et al.  Optimizing the integration of a statistical language model in HMM based offline handwritten text recognition , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[16]  Andreas Stolcke,et al.  Using a stochastic context-free grammar as a language model for speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[17]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[18]  Sargur N. Srihari,et al.  Off-Line Cursive Script Word Recognition , 1989, IEEE Trans. Pattern Anal. Mach. Intell..