Localization of Digit Strings in Farsi/Arabic Document Images Using Structural Features and Syntactical Analysis

This paper presents a new method for localization of digit strings with a specific syntax in Farsi/ Arabic document images. First, some features are extracted from all connected components in each text line. These features, are provided for Farsi/ Arabic scripts, and have the ability to differentiate between digits and non-digit connected components. Then, these features are classified, and the probabilities of being in each of four classes digit, slash, double-digit, and non-digit, is assigned to each connected component. Next, discrete hidden Marcov model as syntactic analyzer, localize digit strings with desired syntaxes. The results which are presented for handwritten and machine-printed text lines, separately, are very promising.

[1]  Clément Chatelain,et al.  A syntax-directed method for numerical field extraction using classifier combination , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[2]  G. Louloudisa,et al.  Text line detection in handwritten documents , 2008 .

[3]  Clément Chatelain,et al.  Discrimination between digits and outliers in handwritten documents applied to the extraction of numerical fields , 2006 .

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  David S. Doermann,et al.  Machine printed text and handwriting identification in noisy document images , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Karim Faez,et al.  Detecting and recognizing numerical strings in Farsi document images , 2009, 2009 24th International Conference Image and Vision Computing New Zealand.

[7]  Bernd Jähne,et al.  BOOK REVIEW: Digital Image Processing, 5th revised and extended edition , 2002 .

[8]  Cheng-Lin Liu,et al.  Handwritten digit recognition: benchmarking of state-of-the-art techniques , 2003, Pattern Recognit..

[9]  M. Tahar Kechadi,et al.  A Hybrid HMM-SVM Method for Online Handwriting Symbol Recognition , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[10]  Karim Faez,et al.  FHT: An Unconstraint Farsi Handwritten Text Database , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[11]  Clément Chatelain,et al.  A two-stage outlier rejection strategy for numerical field extraction in handwritten documents , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[12]  Ching Y. Suen,et al.  Standard Databases for Recognition of Handwritten Digits, Numerical Strings, Legal Amounts, Letters and Dates in Farsi Language , 2006 .

[13]  Karim Faez,et al.  Non-uniform slant estimation and correction for Farsi/Arabic handwritten words , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[14]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[15]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[16]  Thierry Paquet,et al.  Automatic extraction of numerical sequences in handwritten incoming mail documents , 2005, Pattern Recognit. Lett..

[17]  Ching Y. Suen,et al.  Differentiation between alphabetic and numeric data using NN ensembles , 2002, Object recognition supported by user interaction for service robots.

[18]  Yazid M. Sharaiha,et al.  Binary digital image processing - a discrete approach , 1999 .