Kannada text line extraction based on energy minimization and skew correction

There are many governmental, cultural, commercial and educational organizations that manage large number of manuscript textual information. Kannada being one of the official languages of South India, such organizations include Kannada handwritten documents. Text line segmentation in such documents remains an open document analysis problem. Detection and correction of skew angle of the segmented text lines become another important step in document analysis. Most of the segmentation algorithms, for skewed text lines, present in the literature today are sensitive to the degree of skew, direction of skew, and spacing between adjacent lines. In this paper, proposed method for the text line extraction and skew correction of the extracted text lines uses a new cost function, which considers the spacing between text lines and the skew of each text line is used. Precisely, the problem is formulated as an energy minimization problem so that the minimization of the cost function yields a set of text lines. Further it is required to efficiently correct baseline skew and fluctuations of these text lines. This proposed method also uses an efficient algorithm for baseline correction. It consists of normalizing the lower baseline to a horizontal line using a skating window approaches, thus, avoiding the segmentation of text lines into subparts. This approach copes with baselines which are skewed, fluctuating, or both. It differs from machine learning approaches which need manual pixel assignments to baselines. Experimental results show that this baseline correction approach highly improves performance.

[1]  Nam Ik Cho,et al.  State Estimation in a Document Image and Its Application in Text Block Identification and Text Line Extraction , 2010, ECCV.

[2]  Gerlind Plonka-Hoch,et al.  The Curvelet Transform , 2010, IEEE Signal Processing Magazine.

[3]  Venu Govindaraju,et al.  Skew detection for complex document images using fuzzy runlength , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[4]  Horst Bunke,et al.  Handwritten sentence recognition , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[5]  Samy Bengio,et al.  Offline recognition of unconstrained handwritten texts using HMMs and statistical language models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Michael Blumenstein,et al.  New Preprocessing Techniques for Handwritten Word Recognition , 2002 .

[7]  Marcus Liwicki,et al.  A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[8]  Nam Ik Cho,et al.  Text-Line Extraction in Handwritten Chinese Documents Based on an Energy Minimization Framework , 2012, IEEE Transactions on Image Processing.

[9]  Juergen Luettin,et al.  A new normalization technique for cursive handwritten words , 2001, Pattern Recognit. Lett..

[10]  Laurence Likforman-Sulem,et al.  Text line segmentation of historical documents: a survey , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[11]  Laurence Likforman-Sulem,et al.  New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks , 2013, J. Electronic Imaging.

[12]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Sargur N. Srihari,et al.  Off-Line Cursive Script Word Recognition , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Salvador España Boquera,et al.  Improving Offline Handwritten Text Recognition with Hybrid HMM/ANN Models , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.