An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

The performance of document text recognition depends on text line segmentation algorithms, which heavily relies on the type of language, author’s writing style, pen type, and document quality. In this paper, we present a novel unsupervised text-line segmentation algorithm for printed Arabic documents with and without diacritics. The presented approach employs a projection profile along with connected components in an iterative manner to detect text-lines. The primary benefits of the presented algorithm are (i) it is not threshold dependent, (ii) it is not required a training phase for threshold selection, and (iii) it is robust towards page rotation, font type, size, and style variation for both with and without diacritics documents. The extensive computational simulations on manually collected dataset prove the efficiency of the proposed scheme compared with several baseline and states of the art methods, including, Voronoi, X-Y Cut, Docstrum, Smearing and Seam-carving methods. Computational time analysis also presented.

[1]  Raid Saabni Robust and Efficient Text: Line Extraction by Local Minimal Sub-Seams , 2018 .

[2]  Thomas M. Breuel,et al.  Two Geometric Algorithms for Layout Analysis , 2002, Document Analysis Systems.

[3]  Adel M. Alimi,et al.  A New Arabic Printed Text Image Database and Evaluation Protocols , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[4]  Jun Sun,et al.  Human Reading Knowledge Inspired Text Line Extraction , 2018, Cognitive Computation.

[5]  Ramzi A. Haraty,et al.  Arabic Text Recognition , 2004, Int. Arab J. Inf. Technol..

[6]  Mahesh Viswanathan,et al.  A prototype document image analysis system for technical journals , 1992, Computer.

[7]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Anil K. Jain,et al.  A robust and fast skew detection algorithm for generic documents , 1996, Pattern Recognit..

[9]  Song Mao,et al.  Empirical Performance Evaluation Methodology and Its Application to Page Segmentation Algorithms , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Ahmed Lawgali,et al.  Handwritten Digit Recognition based on DWT and DCT , 2015 .

[11]  Sos S. Agaian,et al.  Arabic License Plate Recognition System , 2013 .

[12]  Horst Bunke,et al.  Text line segmentation and word recognition in a system for general writer independent handwriting recognition , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[13]  Yi Li,et al.  Script-Independent Text Line Segmentation in Freestyle Handwritten Documents , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Umapada Pal,et al.  Handwriting segmentation of unconstrained Oriya text , 2004, Ninth International Workshop on Frontiers in Handwriting Recognition.

[15]  J. M. White,et al.  Image Thresholding for Optical Character Recognition and Other Applications Requiring Character Image Extraction , 1983, IBM J. Res. Dev..

[16]  R. Manmatha,et al.  A scale space approach for automatically segmenting words from historical handwritten documents , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Kevin Chen,et al.  DOCLIB: a software library for document processing , 2006, Electronic Imaging.

[18]  Ram Sarkar,et al.  Text-line extraction from handwritten document images using GAN , 2020, Expert Syst. Appl..

[19]  Sukhpreet Singh,et al.  Optical Character Recognition Techniques: A survey , 2013 .

[20]  Syed Saqib Bukhari,et al.  Towards Generic Text-Line Extraction , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[21]  Jihad El-Sana,et al.  Unsupervised text line segmentation , 2020, ArXiv.

[22]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[23]  Christopher Kermorvant,et al.  Fully convolutional network with dilated convolutions for handwritten text line segmentation , 2018, International Journal on Document Analysis and Recognition (IJDAR).

[24]  Marcus Liwicki,et al.  Robust Heartbeat-based Line Segmentation Methods for Regular Texts and Paratextual Elements , 2017, HIP@ICDAR.

[25]  Rui Zhang,et al.  Special Issue Editorial: Cognitively-Inspired Computing for Knowledge Discovery , 2018, Cognitive Computation.

[26]  Motoi Iwata,et al.  Segmentation of Page Images Using the Area Voronoi Diagram , 1998, Comput. Vis. Image Underst..

[27]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Pawel Forczmanski,et al.  Two-stage approach to extracting visual objects from paper documents , 2016, Machine Vision and Applications.

[29]  Abdel Belaïd,et al.  Arabic Handwritten Documents Segmentation into Text-Lines and Words using Deep Learning , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[30]  Song Mao,et al.  Software architecture of PSET: a page segmentation evaluation toolkit , 2002, International Journal on Document Analysis and Recognition.

[31]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[32]  Marçal Rusiñol,et al.  Manuscript Text Line Detection and Segmentation Using Second-Order Derivatives , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[33]  Abderrazak Zahour,et al.  Arabic hand-written text-line extraction , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[34]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Sos S. Agaian,et al.  Practical Recognition System for Text Printed on Clear Reflected Material , 2012 .

[36]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[37]  Ching Y. Suen,et al.  Thinning Methodologies - A Comprehensive Survey , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Jun Sun,et al.  Globally Optimal Text Line Extraction Based on K-Shortest Paths Algorithm , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[39]  Askar Hamdulla,et al.  An adaptive threshold algorithm for offline Uyghur handwritten text line segmentation , 2019, Wireless Networks.

[40]  Frédéric Kaplan,et al.  dhSegment: A Generic Deep-Learning Approach for Document Segmentation , 2018, 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[41]  Sos S. Agaian,et al.  A Robust Line Segmentation Algorithm for Arabic Printed Text with Diacritics , 2017, Image Processing: Algorithms and Systems.

[42]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Umapada Pal,et al.  Multioriented and curved text lines extraction from Indian documents , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).