Multilingual corpus construction based on printed and handwritten character separation

This paper proposes an effective method to extract printed and handwritten characters from multilingual document images to build corpus. To extract the characters from the document images, a connected component analysis method is used to remove the graphics. After that, multiple types of features and AdaBoost algorithm are introduced to classify printed and handwritten characters in a more versatile and robust way. Firstly, the content of the image is divided into several text patches which are then used to distinguish different languages. Secondly, we use the multiple types of features and AdaBoost algorithm to train the classifiers based on the segmented patches. Finally, we can separate printed and handwritten parts of new image set by the trained classifiers. The proposed method improves the precision of the extraction of written materials in text images of different languages. Experimental results demonstrate that the proposed method is more accurate in terms of precision and recall rate compared with the state-of the-art methods.

[1]  Adnan Amin,et al.  Page segmentation and classification utilising a bottom-up approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  Rama Chellappa,et al.  Classification of textures using Gaussian Markov random fields , 1985, IEEE Trans. Acoust. Speech Signal Process..

[3]  Changsong Liu,et al.  Single-character type identification , 2001, IS&T/SPIE Electronic Imaging.

[4]  Akira Hirose,et al.  Handwritten Character Distinction Method Inspired by Human Vision Mechanism , 2007, ICONIP.

[5]  Jürgen Franke,et al.  Writing style detection by statistical combination of classifiers in form reader applications , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[6]  David S. Doermann,et al.  Machine printed text and handwriting identification in noisy document images , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Paramvir Bahl,et al.  Recognition of handwritten word: first and second order hidden Markov model based approach , 1988, Proceedings CVPR '88: The Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Yue Gao,et al.  3-D Object Retrieval and Recognition With Hypergraph Analysis , 2012, IEEE Transactions on Image Processing.

[9]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[10]  Aya Soffer Image categorization using texture features , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  Bidyut Baran Chaudhuri,et al.  Machine-printed and hand-written text lines identification , 2001, Pattern Recognit. Lett..

[12]  Basilios Gatos,et al.  Handwriting Segmentation Contest , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[13]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Laurence Anthony,et al.  A critical look at software tools in corpus linguistics , 2013 .

[15]  Kuo-Chin Fan,et al.  Classification Of Machine-Printed And Handwritten Texts Using Character Block Layout Variance , 1998, Pattern Recognit..

[16]  Gert Storms,et al.  A corpus study of semantic patterns in compounding , 2010 .

[17]  Anil K. Jain,et al.  Page segmentation using tecture analysis , 1996, Pattern Recognit..

[18]  Georgios Louloudis,et al.  ICDAR 2009 Handwriting Segmentation Contest , 2009, ICDAR.

[19]  Patrick Kelly,et al.  Automatic script identification from images using cluster-based templates , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[20]  Shlomo Argamon,et al.  Complex document information processing: prototype, test collection, and evaluation , 2006, Electronic Imaging.

[21]  Nina Vyatkina Review of Multilingual Corpora and Multilingual Corpus Analysis , 2014 .

[22]  Sargur N. Srihari,et al.  A system to read names and addresses on tax forms , 1996 .

[23]  Zsolt Miklós Kovács-Vajna,et al.  A system for machine-written and hand-written character distinction , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[24]  Zhengjun Zha,et al.  Gradient-domain-based enhancement of multi-view depth video , 2014, Inf. Sci..

[25]  Stig Johansson Towards a multilingual corpus for contrastive analysis and translation studies , 2002 .

[26]  David S. Doermann,et al.  The Segmentation and Identification of Handwriting in Noisy Document Images , 2002, Document Analysis Systems.

[27]  Xuelong Li,et al.  Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search , 2013, IEEE Transactions on Image Processing.

[28]  Jinhong Katherine Guo,et al.  Separating handwritten material from machine printed text using hidden Markov models , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[29]  Sridha Sridharan,et al.  Texture for script identification , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.