Arabic Text Recognition Using a Script-Independent Methodology: A Unified HMM-Based Approach for Machine-Printed and Handwritten Text

We describe BBN’s script-independent methodology for multilingual machine-print OCR and offline handwriting recognition (HWR) based on the use of hidden Markov models (HMM). The feature extraction, training, and recognition components of the system are all designed to be script-independent. The HMM training and recognition components are based on BBN’s Byblos hidden Markov modeling software. The HMM parameters are estimated automatically from the training data, without the need for laborious manually created rules. The system does not require any pre-segmentation of the data, either at the word level or at the character level. Thus, the system can handle languages with cursive handwritten scripts in a straightforward manner. The script independence of the system is demonstrated with experimental results in three scripts that exhibit significant differences in glyph characteristics: Arabic, Chinese, and English. Experimental results demonstrating the viability of the proposed methodology are presented. Offline HWR of free-flowing Arabic text is a challenging task due to the plethora of factors that contribute to the variability in the data. In light of this book’s focus on Arabic scripts, we address some of these sources of variability, and present experimental results on a large corpus of handwritten documents. Experimental results are provided for specific techniques such as the application of context-dependent HMMs for the cursive Arabic script and unsupervised adaptation to account for the stylistic variations across scribes/writers. We also present an innovative integration of structural features in the HMM framework which results in a 10 % relative improvement in performance. We conclude with a new technique for dealing with noise related to the dots that are an integral yet disconnected part of many Arabic characters.

[1]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[2]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[3]  A. Kundu,et al.  Recognition of handwritten script: a hidden Markov model based approach , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  Venu Govindaraju,et al.  Probabilistic model for segmentation based word recognition with lexicon , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[5]  Xuedong Huang,et al.  Semi-continuous hidden Markov models for speech recognition , 1989 .

[6]  Sun-Yuan Kung,et al.  Hidden Markov models for character recognition , 1992, IEEE Trans. Image Process..

[7]  Shaoping Ma,et al.  Feature extraction by hierarchical overlapped elastic meshing for handwritten Chinese character recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8]  Seong-Whan Lee,et al.  A truly 2-D hidden Markov model for off-line handwritten character recognition , 1998, Pattern Recognit..

[9]  Georg Heigold,et al.  Confidence- and margin-based MMI/MPE discriminative training for off-line handwriting recognition , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[10]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter models for large vocabulary isolated speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[11]  Roger K. Moore Computer Speech and Language , 1986 .

[12]  Chinmoy B. Bose,et al.  Connected and degraded text recognition using hidden Markov model , 1994, Pattern Recognit..

[13]  András Kornai,et al.  An experimental HMM-based postal OCR system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Richard M. Schwartz,et al.  Advances in the BBN BYBLOS OCR system , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[16]  Abdel Belaïd,et al.  Hidden Markov Models in Text Recognition , 1995, Int. J. Pattern Recognit. Artif. Intell..

[17]  Richard M. Schwartz,et al.  A Script-Independent Methodology For Optical Character Recognition , 1998, Pattern Recognit..

[18]  Jayant Kumar,et al.  Segmentation of Handwritten Textlines in Presence of Touching Components , 2011, 2011 International Conference on Document Analysis and Recognition.

[19]  Richard M. Schwartz,et al.  On-line cursive handwriting recognition using speech recognition methods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Sabri A. Mahmoud,et al.  Survey and bibliography of Arabic optical text recognition , 1995, Signal Process..

[21]  Horst Bunke,et al.  Off-line cursive handwriting recognition using hidden markov models , 1995, Pattern Recognit..

[22]  Roberto Pieraccini,et al.  Dynamic planar warping for optical character recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Oscar E. Agazzi,et al.  Hidden markov model based optical character recognition in the presence of deterministic transformations , 1993, Pattern Recognit..

[24]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[25]  Rohit Prasad,et al.  The BBN document analysis service: a platform for multilingual document translation , 2010, DAS '10.

[26]  Sargur N. Srihari,et al.  Handwritten word recognition using continuous density variable duration hidden Markov model , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Rohit Prasad,et al.  Graph Clustering-Based Ensemble Method for Handwritten Text Line Segmentation , 2011, 2011 International Conference on Document Analysis and Recognition.

[28]  Rohit Prasad,et al.  A stroke regeneration method for cleaning rule-lines in handwritten document images , 2009, MOCR '09.

[29]  Yong Haur Tay,et al.  An offline cursive handwritten word recognition system , 2001, Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology. TENCON 2001 (Cat. No.01CH37239).

[30]  Paramvir Bahl,et al.  Recognition of handwritten word: First and second order hidden Markov model based approach , 1989, Pattern Recognit..

[31]  Abdel Belaïd,et al.  Printed PAW recognition based on planar hidden Markov models , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[32]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[33]  Paul D. Gader,et al.  Handwritten Word Recognition Using Segmentation-Free Hidden Markov Modeling and Segmentation-Based Dynamic Programming Techniques , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Richard M. Schwartz,et al.  Multilingual Machine Printed OCR , 2001, Int. J. Pattern Recognit. Artif. Intell..

[35]  Yuan Yan Tang,et al.  Offline Recognition of Chinese Handwriting by Multifeature and Multilevel Classification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Geetha Srikantan,et al.  A multiple feature/resolution approach to handprinted digit and character recognition , 1996 .

[37]  Long Nguyen,et al.  Multiple-Pass Search Strategies , 1996 .

[38]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[39]  Rae-Hong Park,et al.  Off-line recognition of handwritten Korean and alphanumeric characters using hidden Markov models , 1996, Pattern Recognit..

[40]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[41]  Torsten Caesar,et al.  Sophisticated topology of hidden Markov models for cursive script recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[42]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[43]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[44]  Woo Sung Kim,et al.  Off-line recognition of handwritten Korean and alphanumeric characters using hidden Markov models , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[45]  Steve Austin,et al.  The forward-backward search algorithm , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[46]  Samy Bengio,et al.  Offline recognition of unconstrained handwritten texts using HMMs and statistical language models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Venu Govindaraju,et al.  Markov Random Field Based Text Identification from Annotated Machine Printed Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[48]  Cheng-Lin Liu,et al.  Global shape normalization for handwritten Chinese character recognition: a new method , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[49]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[50]  Venu Govindaraju,et al.  Nested state indexing in pairwise Markov networks for fast handwritten document image rule-line removal , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[51]  Theodosios Pavlidis,et al.  Character Recognition Without Segmentation , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  May Allam Segmentation versus segmentation-free for recognizing Arabic text , 1995, Electronic Imaging.

[53]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[54]  Xuedong Huang,et al.  Semi-continuous hidden Markov models for speech signals , 1990 .

[55]  Volker Märgner,et al.  Arabic Handwriting Recognition Competition , 2005, ICDAR.