Text versus non-text distinction in online handwritten documents

The aim of this paper is to explore how well the task of text vs. nontext distinction can be solved in online handwritten documents using only offline information. Two systems are introduced. The first system generates a document segmentation first. For this purpose, four methods originally developed for machine printed documents are compared: x-y cut, morphological closing, Voronoi segmentation, and whitespace analysis. A state-of-the art classifier then distinguishes between text and non-text zones. The second system follows a bottom-up approach that classifies connected components. Experiments are performed on a new dataset of online handwritten documents containing different content types in arbitrary arrangements. The best system assigns 94.3% of the pixels to the correct class.

[1]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[2]  Thomas M. Breuel,et al.  Document image zone classification - a simple high-performance approach , 2007, VISAPP.

[3]  Anil K. Jain,et al.  Structure in on-line documents , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[4]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[5]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[6]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[7]  Henry S. Baird,et al.  Document image content inventories , 2007, Electronic Imaging.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[11]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[12]  Charalambos Strouthopoulos,et al.  Text identification for document image analysis using a neural network , 1998, Image Vis. Comput..

[13]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[14]  Matti Pietikäinen,et al.  Page Segmentation and Zone Classification: The State of the Art , 1999 .

[15]  Paul A. Viola,et al.  Learning nongenerative grammatical models for document analysis , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[16]  Louis Vuurpijl,et al.  Mode detection and incremental recognition , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[17]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[18]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Xinhua Zhuang,et al.  Image Analysis Using Mathematical Morphology , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.