Suppression of non-text components in handwritten document images

Document layout analysis is a pre-processing step to convert handwritten/printed documents into electronic form through Optical Character Recognition (OCR) system. Handwritten documents are usually unstructured i.e. they do not have a specific layout and most documents may contain some non-text regions e.g. graphs, tables, diagrams etc. Therefore, such documents cannot be directly given as input to the OCR system without suppressing the non-text regions in the documents. The traditional Run Length Smoothing Algorithm (RLSA) does not produce good results for handwritten document pages, since the text components in it have lesser pixel density than those in printed text. In present work, a modified RLSA, called Spiral Run Length Smearing Algorithm (SRLSA), is applied to suppress the non-text components from text ones in handwritten document images. The components in the document pages are then classified into text/non-text groups using a Support Vector Machine (SVM) classifier. The method shows a success rate of 83.3% on a dataset of 3000 components.

[1]  Zhaoyang Lu,et al.  Detection of Text Regions From Digital Engineering Drawings , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Carl G. Looney,et al.  Fast connected component labeling algorithm using a divide and conquer technique , 2000, CATA.

[3]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[4]  Romain Raveaux,et al.  A colour text/graphics separation based on a graph representation , 2008, 2008 19th International Conference on Pattern Recognition.

[5]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  S. Imade,et al.  Segmentation and classification for mixed text/image documents using neural network , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[8]  Subhadip Basu,et al.  Text/Graphics Separation and Skew Correction of Text Regions of Business Card Images for Mobile Devices , 2010, ArXiv.

[9]  Frank Hönes,et al.  Separation of Textual and Non-textual Information within Mixed-Mode Documents , 1992, MVA.

[10]  Jean-Philippe Thiran,et al.  Text identification in complex background using SVM , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[11]  Chew Lim Tan,et al.  Separation of overlapping text from graphics , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[12]  J. M. Gloger,et al.  Use of the Hough transform to separate merged text/graphics in forms , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[13]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[14]  Koichi Harada,et al.  Connected Component Labeling Algorithms for Gray-Scale Images and Evaluation of Performance using Digital Mammograms , 2008 .

[15]  Chew Lim Tan,et al.  Text/graphics separation using agent-based pyramid operations , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).