Automatic Identification of Errors in Arabic Handwriting Recognition

Arabic handwriting recognition (HR) is a challenging problem due to Arabic’s connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identification of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic (morphological and syntactic) and non-linguistic features to automatically identify these errors. We also consider a learning curve varying in two dimensions: number of segments and number of n-best hypotheses to train on. We additionally evaluate the performance on different test sets with different degrees of errors in them. Our best approach achieves a roughly ∼20% absolute increase in F-score over a simple but reasonable baseline. A detailed error analysis shows that linguistic features, such as lemma models, help improve HR-error detection precisely where we expect them to: semantically inconsistent error words.

[1]  Rohit Prasad,et al.  Improvements in BBN's HMM-Based Offline Arabic Handwriting Recognition System , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[2]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[3]  Nizar Habash,et al.  Online Arabic Handwriting Recognition Using Hidden Markov Models , 2006 .

[4]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[5]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[6]  Walid Magdy,et al.  Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology , 2006, EMNLP.

[7]  Richard M. Schwartz,et al.  Robust language-independent OCR system , 1999, Other Conferences.

[8]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[9]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[10]  Douglas W. Oard,et al.  Term selection for searching printed Arabic , 2002, SIGIR '02.

[11]  Stefan Jäger,et al.  Arabic and Chinese Handwriting Recognition - SACH 2006 Summit College Park, MD, USA, September 27-28, 2006 Selected Papers , 2008, SACH.

[12]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[13]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[14]  M. Sellami,et al.  MOrpho-LEXical analysis for correcting OCR-generated Arabic words (MOLEX) , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[15]  Rickard Domeij,et al.  Detection of Spelling Errors in Swedish Not Using a Word List En Clair , 1994, J. Quant. Linguistics.

[16]  Bidyut Baran Chaudhuri,et al.  OCR Error Correction of an Inflectional Indian Language Using Morphological Parsing , 2000, J. Inf. Sci. Eng..