Document-Zone Classification in Torn Documents

Arbitrary orientation and sparse data content are common characteristics of torn document. To ensure accuracy and reliability in computer-based analysis, content-zone segmentation is required. In our previous work, we studied segmentation of handwritten and printed text. A questioned document-piece in the form of an office note, however, might also contain non-text data like logos, graphics, and pictures. Hence a more precise content-zone classification is required. In this paper we propose a two-tier approach for non-text, handwriting and printed text segmentation. The first tier aims to discriminate text and non-text regions. The second tier classifies handwritten and printed text within all text zones identified during the first tier. Gabor features and chain-code features are used in Tier-1 and Tier-2, respectively. By using SVM classifier we successfully identified 97.65% of 31,227 text regions in our current test data. The proposed approach identified 98.69% of printed and 96.39% of handwritten text amongst all identified text regions.

[1]  Henry S. Baird,et al.  Image segmentation by shape-directed covers , 1990, [1990] Proceedings. 10th International Conference on Pattern Recognition.

[2]  Tetsushi Wakabayashi,et al.  Handwritten Numeral Recognition of Six Popular Indian Scripts , 2007 .

[3]  David S. Doermann,et al.  Machine printed text and handwriting identification in noisy document images , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Yalin Wang,et al.  IMPROVEMENT OF ZONE CONTENT CLASSIFICATION BY USING BACKGROUND ANALYSIS , 2000 .

[5]  Umapada Pal,et al.  Structural handwritten and machine print classification for sparse content and arbitrary oriented document fragments , 2010, SAC '10.

[6]  V. K. Govindan,et al.  Character recognition - A review , 1990, Pattern Recognit..

[7]  Bidyut Baran Chaudhuri,et al.  Machine-printed and hand-written text lines identification , 2001, Pattern Recognit. Lett..

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  K. R. Arvind,et al.  A Robust Two Level Classification Algorithm for Text Localization in Documents , 2007, ISVC.

[10]  Chang-Sung Jeong,et al.  A straight line detection using principal component analysis , 2006, Pattern Recognit. Lett..

[11]  Yun-Seok Nam,et al.  Classification of machine-printed and handwritten addresses on Korean mail piece images using geometric features , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Zsolt Miklós Kovács-Vajna,et al.  A system for machine-written and hand-written character distinction , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[14]  Miguel A. Patricio,et al.  Segmentation of Text and Graphics/Images Using the Gray-Level Histogram Fourier Transform , 2000, SSPR/SPR.

[15]  Kuo-Chin Fan,et al.  Classification Of Machine-Printed And Handwritten Texts Using Character Block Layout Variance , 1998, Pattern Recognit..

[16]  Henry S. Baird,et al.  Iterated Document Content Classification , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[17]  Henry S. Baird,et al.  Iterated Document Content Classification , 2007 .

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Sekhar Mandal,et al.  Segmentation of Text and Graphics from Document Images , 2007 .

[20]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Jinhong Katherine Guo,et al.  Separating handwritten material from machine printed text using hidden Markov models , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[22]  Rangachar Kasturi,et al.  A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images , 1988, IEEE Trans. Pattern Anal. Mach. Intell..