Impact of Character Models Choice on Arabic Text Recognition Performance

We analyze in this paper the impact of sub-models choice for automatic Arabic printed text recognition based on Hidden Markov Models (HMM). In our approach, sub-models correspond to characters shapes assembled to compose words models. One of the peculiarities of Arabic writing is to present various character shapes according to their position in the word. With 28 basic characters, there are over 120 different shapes. Ideally, there should be one sub model for each different shape. However, some shapes are less frequent than others and, as training databases are finite, the learning process leads to less reliable models for the infrequent shapes. We show in this paper that an optimal set of models has then to be found looking for the trade-off between having more models capturing the intricacies of shapes and grouping the models of similar shapes with other. We propose in this paper different sets of sub-models that have been evaluated using the Arabic Printed Text Image (APTI) Database freely available for the scientific community.

[1]  Volker Märgner,et al.  ICDAR 2009-Arabic handwriting recognition competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[2]  M. Sellami,et al.  MOrpho-LEXical analysis for correcting OCR-generated Arabic words (MOLEX) , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[3]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4]  Adel M. Alimi,et al.  Affixal approach for Arabic decomposable vocabulary recognition a validation on printed word in only one font , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[5]  Najoua Essoukri Ben Amara,et al.  Planar Markov modeling for Arabic writing recognition: advancement state , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Volker Märgner,et al.  Arabic Handwriting Recognition Competition , 2005, ICDAR.

[8]  Mohammad S. Khorsheed,et al.  Recognising handwritten Arabic manuscripts using a single hidden Markov model , 2003, Pattern Recognit. Lett..

[9]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[10]  Mohammad S. Khorsheed,et al.  Offline recognition of omnifont Arabic text using the HMM ToolKit (HTK) , 2007, Pattern Recognit. Lett..

[11]  Adel M. Alimi,et al.  Database and Evaluation Protocols for Arabic Printed Text Recognition , 2009 .

[12]  R. Ingold,et al.  Modèles de Markov Cachés et Modèle de Longueur pour la Reconnaissance de l’Ecriture Arabe à Basse Résolution , 2009 .

[13]  Adel M. Alimi,et al.  Duration Models for Arabic Text Recognition Using Hidden Markov Models , 2008, 2008 International Conference on Computational Intelligence for Modelling Control & Automation.

[14]  Sabah S. Al-Fedaghi,et al.  Machine Recognition of Printed Arabic Text Utilizing Natural Language Morphology , 1991, Int. J. Man Mach. Stud..

[15]  Husni Al-Muhtaseb,et al.  Recognition of off-line printed Arabic text using Hidden Markov Models , 2008, Signal Process..

[16]  Rolf Ingold,et al.  A Language-Independent, Open-Vocabulary System Based on HMMs for Recognition of Ultra Low Resolution Words , 2008, J. Univers. Comput. Sci..

[17]  Adel M. Alimi,et al.  A New Arabic Printed Text Image Database and Evaluation Protocols , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[18]  Abdel Belaïd,et al.  A novel approach for the recognition of a wide Arabic handwritten word lexicon , 2008, 2008 19th International Conference on Pattern Recognition.