A Zone Classification Approach for Arabic Documents using Hybrid Features

Zone segmentation and classification is an important step in document layout analysis. It decomposes a given scanned document into zones. Zones need to be classified into text and non-text, so that only text zones are provided to a recognition engine. This eliminates garbage output resulting from sending non-text zones to the engine. This paper proposes a framework for zone segmentation and classification. Zones are segmented using morphological operation and connected component analysis. Features are then extracted from each zone for the purpose of classification into text and non-text. Features are hybrid between texture-based and connected component based features. Effective features are selected using genetic algorithm. Selected features are fed into a linear SVM classifier for zone classification. System evaluation shows that the proposed zone classification works well on multi-font and multi-size documents with a variety of layouts even on historical documents.

[1]  Syed Saqib Bukhari,et al.  Layout Analysis of Arabic Script Documents , 2012 .

[2]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Charalambos Strouthopoulos,et al.  Text identification for document image analysis using a neural network , 1998, Image Vis. Comput..

[5]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[6]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Matti Pietikäinen,et al.  Edge-based method for text detection from complex document images , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[8]  Dan S. Bloomberg,et al.  Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[9]  Matti Pietikäinen,et al.  A SURVEY OF TEXTURE-BASED METHODS FOR DOCUMENT LAYOUT ANALYSIS , 2000 .

[10]  Mohamed Attia,et al.  Autonomously normalized horizontal differentials as features for HMM-based Omni font-written OCR systems for cursively scripted languages , 2009, 2009 IEEE International Conference on Signal and Image Processing Applications.

[11]  C. V. Jawahar,et al.  On Segmentation of Documents in Complex Scripts , 2007 .

[12]  David S. Doermann,et al.  Voronoi++: A Dynamic Page Segmentation Approach Based on Voronoi and Docstrum Features , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[13]  P Sona,et al.  OCR (Optical Character Recognition) Based Reading Aid , 2018, 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI).

[14]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[15]  Thomas M. Breuel,et al.  Performance Comparison of Six Algorithms for Page Segmentation , 2006, Document Analysis Systems.

[16]  Henry S. Baird Background Structure in Document Images , 1994, Int. J. Pattern Recognit. Artif. Intell..

[17]  reh OCR — Optical Character Recognition , 2012, Orthopädie & Rheuma.

[18]  Syed Saqib Bukhari,et al.  Document image segmentation using discriminative learning over connected components , 2010, DAS '10.

[19]  Henry S. Baird,et al.  Truthing for Pixel-Accurate Segmentation , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[20]  Jian-xiong Dong,et al.  Cursive word skew/slant corrections based on Radon transform , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[21]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[22]  Thomas M. Breuel,et al.  Efficient implementation of local adaptive thresholding techniques using integral images , 2008, Electronic Imaging.

[23]  Syed Saqib Bukhari,et al.  Improved document image segmentation algorithm using multiresolution morphology , 2011, Electronic Imaging.