Enhancing image-based Arabic document translation using noisy channel correction model

An image-based document translation system consists of several components, among which OCR (Optical Character Recognition) plays an important role. However, existing OCR software is not robust against environmental variations. Furthermore, OCR errors are often propagated into the translation component and cause, causing poor end-to-end performance. In this paper, we propose an imagebased document translation using an error correction model to correct misrecognized words from OCR output. We train our correction model from synthetic data with different fonts and sizes to simulate real world situations. We further enhance our correction model with bigrams to improve our word segmentation error correction. Experimental results show substantial improvements in both word recognition accuracy and translation quality. For instance, in an experiment using Arabic Transparent Font, the BLEU score increases from 18.70 to 33.47 with the use of our noisy channel model.

[1]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[2]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .

[3]  Philip Resnik,et al.  OCR error correction using a noisy channel model , 2002 .

[4]  Ying Zhang,et al.  Towards Automatic Sign Translation , 2001, HLT.

[5]  Chiori Hori,et al.  Overview of the IWSLT 2005 Evaluation Campaign , 2005, IWSLT.

[6]  George Nagy,et al.  Optical character recognition: an illustrated guide to the frontier , 1999, Electronic Imaging.

[7]  Tao Hong,et al.  Degraded text recognition using visual and linguistic context , 1996 .

[8]  Takeo Kanade,et al.  Video OCR: indexing digital news libraries by recognition of superimposed captions , 1999, Multimedia Systems.

[9]  David Doermann,et al.  Generating Synthetic Data for Text Analysis Systems , 1995 .

[10]  Tapas Kanungo,et al.  OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products , 1999, Electronic Imaging.

[11]  Peter Schäuble,et al.  Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue , 1995, SIGIR '95.

[12]  Julie Borsack,et al.  Expert system for automatically correcting OCR output , 1994, Electronic Imaging.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Andreas Myka,et al.  Measuring the effects of OCR errors on similarity linking , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[15]  Ying Zhang,et al.  Automatic detection and translation of text from natural scenes , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Ying Zhang,et al.  PanDoRA: a large-scale two-way statistical machine translation system for hand-held devices , 2007, MTSUMMIT.

[17]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .