论文信息 - Statistical Machine Translation as a Language Model for Handwriting Recognition

Statistical Machine Translation as a Language Model for Handwriting Recognition

When performing handwriting recognition on natural language text, the use of a word-level language model (LM) is known to significantly improve recognition accuracy. The most common type of language model, the n-gram model, decomposes sentences into short, overlapping chunks. In this paper, we propose a new type of language model which we use in addition to the standard n-gram LM. Our new model uses the likelihood score from a statistical machine translation system as a reranking feature. In general terms, we automatically translate each OCR hypothesis into another language, and then create a feature score based on how "difficult" it was to perform the translation. Intuitively, the difficulty of translation correlates with how well-formed the input sentence is. In an Arabic handwriting recognition task, we were able to obtain an 0.4% absolute improvement to word error rate (WER) on top of a powerful 5-gram LM.

[1] Ian H. Witten,et al. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[2] Hermann Ney,et al. Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3] Brian Roark,et al. Probabilistic Top-Down Parsing and Language Modeling , 2001, CL.

[4] Hermann Ney,et al. The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[5] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[6] Brian Roark,et al. Discriminative Syntactic Language Modeling for Speech Recognition , 2005, ACL.

[7] David Chiang,et al. Hierarchical Phrase-Based Translation , 2007, CL.

[8] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[9] Jinxi Xu,et al. A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model , 2008, ACL.

[10] Rohit Prasad,et al. Improvements in hidden Markov model based Arabic OCR , 2008, 2008 19th International Conference on Pattern Recognition.

[11] Chris Callison-Burch,et al. Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation , 2009, ACL.

[12] Robert Dale,et al. United Nations General Assembly Resolutions : a six-language parallel corpus , 2009 .

[13] Rohit Prasad,et al. The BBN document analysis service: a platform for multilingual document translation , 2010, DAS '10.

[14] Christopher D. Manning,et al. Better Arabic Parsing: Baselines, Evaluations, and Analysis , 2010, COLING.