Autonomously normalized horizontal differentials as features for HMM-based Omni font-written OCR systems for cursively scripted languages

Automatic font-written Optical Character Recognition (OCR) is highly desirable for numerous modern information technology (IT) applications. Reliable font-written OCR's for Latin scripts are readily in use since long. For cursively scripted languages, that are the mother tongues of over one fourth of the world population, such OCR's are however not available at a robust and reliable performance. In this regard, the main challenge is the mandatory connectivity of characters/ligatures (i.e. graphemes) that has to be resolved simultaneously upon the recognition of these graphemes. Among the various approaches tried over decades, Hidden Markov Models (HMM)-based OCR's seem to be the most promising as they capitalize on the ability of HMM decoders to achieve segmentation and recognition simultaneously similar to the widely used HMM-based automatic speech recognition (ASR). Unlike ASR's, what is missing in HMM-based OCR's is the definition of a rigorously founded features vector capable to robustly achieving minimal “font type/size-independent” (omnifont) word error rates comparable to those realized with Latin scripts. Here comes the contribution of this paper that introduces such a sound features vector design, and experimentally shows its superiority in this regard.

[1]  Sabri A. Mahmoud,et al.  Survey and bibliography of Arabic optical text recognition , 1995, Signal Process..

[2]  Horst Bunke,et al.  Off-line cursive handwriting recognition using hidden markov models , 1995, Pattern Recognit..

[3]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[4]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[5]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[6]  Mohammad S. Khorsheed,et al.  Offline recognition of omnifont Arabic text using the HMM ToolKit (HTK) , 2007, Pattern Recognit. Lett..

[7]  Chafic Mokbel,et al.  Building Annotated Written and Spoken Arabic LRs in NEMLAR Project , 2006, LREC.

[8]  Stephen E. Levinson,et al.  A speaker-independent, syntax-directed, connected word recognition system based on hidden Markov models and level building , 1985, IEEE Trans. Acoust. Speech Signal Process..

[9]  David G. Stork,et al.  Pattern Classification , 1973 .

[10]  Anil K. Jain,et al.  Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..

[11]  Mohamed S. El-Mahallawy,et al.  Histogram-Based Lines and Words Decomposition for Arabic Omni Font-Written OCR Systems; Enhancements and Evaluation , 2007, CAIP.

[12]  Richard M. Schwartz,et al.  An Omnifont Open-Vocabulary OCR System for English and Arabic , 1999, IEEE Trans. Pattern Anal. Mach. Intell..