Language independent rule based classification of printed & handwritten text

Handwriting in data entry forms/documents usually indicates user's filled information that should be treated differently from the printed text. In Arab world, these filled information are normally in English or Arabic. Secondly, classification approaches are quite different for machine printed and script. Therefore, prior to segmentation & classification, text distinction into Printed & script entries is mandatory. In this research, the dilemma of the language independent text distinction in multilingual data entry forms is addressed. Our main focus is to distinguish the machine printed text and script in multilingual data entry forms that are language independent. The proposed approach explore new statistical and structural features of text lines to classify them into separate categories. Accordingly a set of classification rules is derived to explicitly differentiate machine printed and handwritten entries, written in any language. Additional, novelty of the proposed approach is that no training/training data is required rather text is discriminated on basis of simple rules. Promising experimental results with 90 % accuracy exhibit that proposed approach is simple and robust. Finally, the scheme is independent of language, style, size, and fonts that commonly co-exist in multilingual data entry forms.

[1]  Amjad Rehman,et al.  Effects of artificially intelligent tools on pattern recognition , 2013, Int. J. Mach. Learn. Cybern..

[2]  Amjad Rehman,et al.  Performance analysis of character segmentation approach for cursive script recognition on benchmark database , 2011, Digit. Signal Process..

[3]  David S. Doermann,et al.  Machine printed text and handwriting identification in noisy document images , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Kuo-Chin Fan,et al.  Classification Of Machine-Printed And Handwritten Texts Using Character Block Layout Variance , 1998, Pattern Recognit..

[5]  Tanzila Saba,et al.  Semantic analysis based forms information retrieval and classification , 2013 .

[6]  Zsolt Miklós Kovács-Vajna,et al.  A system for machine-written and hand-written character distinction , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[7]  Bidyut Baran Chaudhuri,et al.  Machine-printed and hand-written text lines identification , 2001, Pattern Recognit. Lett..

[8]  Akira Hirose,et al.  Distinction between handwritten and machine-printed characters with no need to locate character or text line position , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[9]  Charalambos Strouthopoulos,et al.  A robust technique for text extraction in mixed-type binary documents , 2008, 2008 19th International Conference on Pattern Recognition.

[10]  M. C. Padma,et al.  Text Line Identification from a Multilingual Document , 2009, 2009 International Conference on Digital Image Processing.

[11]  Amjad Rehman,et al.  DOCUMENT SKEW ESTIMATION AND CORRECTION: ANALYSIS OF TECHNIQUES, COMMON PROBLEMS AND POSSIBLE SOLUTIONS , 2011, Appl. Artif. Intell..

[12]  A.G. Ramakrishnan,et al.  Automatic text block separation in document images , 2006, 2006 Fourth International Conference on Intelligent Sensing and Information Processing.

[13]  Ghazali Sulong,et al.  Non-Linear Segmentation of Touched Roman Characters Based on Genetic Algorithm , 2010 .

[14]  Sargur N. Srihari,et al.  Off-Line Cursive Script Word Recognition , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[16]  Amjad Rehman,et al.  Neural networks for document image preprocessing: state of the art , 2014, Artificial Intelligence Review.

[17]  RehmanAmjad,et al.  Neural networks for document image preprocessing , 2014 .

[18]  Mahmoud R. El-Sakka,et al.  Novel Adaptive Filtering for Salt-and-Pepper Noise Removal from Binary Document Images , 2004, ICIAR.