Character Segmentation Scheme for OCR System: For Myanmar Printed Documents

Automatic machine-printed Optical Characters or texts Recognizers (OCR) are highly desirable for a multitude of modern IT applications, including Digital Library software. However, the state of the art OCR systems cannot do for Myanmar scripts as the language poses many challenges for document understanding. Therefore, the authors design an Optical Character Recognition System for Myanmar Printed Document (OCRMPD), with several proposed techniques that can automatically recognize Myanmar printed text from document images. In order to get more accurate system, the authors propose the method for isolation of the character image by using not only the projection methods but also structural analysis for wrongly segmented characters. To reveal the effectiveness of the segmentation technique, the authors follow a new hybrid feature extraction method and choose the SVM classifier for recognition of the character image. The proposed algorithms have been tested on a variety of Myanmar printed documents and the results of the experiments indicate that the methods can increase the segmentation accuracy as well as recognition rates. DOI: 10.4018/ijcvip.2011100104 International Journal of Computer Vision and Image Processing, 1(4), 50-58, October-December 2011 51 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Chinese, etc. As automatic machine-printed Optical Character Recognizers (OCR) are highly desirable for a multitude of modern IT applications, including Digital Library software, efficient OCR systems for Myanmar text are one of the present day requirements. Therefore, we need to concern with printed characters, since handwritten characters become less and less used and only found in signatures because of computerization everywhere. For an OCR system, segmentation phase is an important phase and accuracy of any OCR heavily depends upon segmentation phase. Character segmentation is an operation that seeks to decompose an image of a sequence of characters into sub images of individual symbols. It is one of the decision processes in a system for optical character recognition (OCR). The demand for greater than 99% accuracy for printed OCR mandates that the error budget for segmentation be very small, which is indeed a significant challenge for the complex scripts such as those in the Brahmi family. Even in good quality documents, some adjacent characters touch each other due to inappropriate scanning resolution. While there are several scripts for which the process of character segmentation is well researched, and for which very good solutions do exist, there are many more scripts for which the segmentation error rate is high enough to make those OCRs impractical to use. And South-East Asian scripts, syllabic scripts which are in turn a complex combination of one or more characters require different procedures for character alignment and segmentation (Casey & Lecolinet, 1996; Hasnat & Khan, 2009; Agrawal & Doermann, 2008; Kumar & Sengar, 2010). Therefore, in this paper, the Optical Character Recognition System for Myanmar Printed Document (OCRMPD) is proposed for our script and the segmentation of overlapped characters is addressed. The proposed algorithm is based on projection profiles and connected component analysis depending on the nature and structure of our script. The rest of the paper is organized as follows. Section 2 introduces the nature of Myanmar script. Section 3 discusses the previous work as the background theories. Section 4 explains more details on our implementation of recognition system. Section 5 discusses the experimental results and Section 6 is the conclusion. 2. NATURE OF MYANMAR SCRIPT In Myanmar script, there is no distinction between upper case and lower case characters. The direction of writing is from left to right in horizontally. The character set consists of 35 consonants (including ‘ ’ and ‘ ’), 8 vowels signs, 7 independent vowels, 5 combining marks, 6 symbols and punctuations, and 10 digits. Each word can be formed by combining consonants, vowels and various signs. It has its own specified composition rules for combining vowels, consonants and modifiers. There are total of above 1881glyphs and has many similarity scripts in this language (e.g., , and so on). When writing text, space is used after each phrase instead of each word or syllable. The shapes of Myanmar scripts are circular, consist of straight lines horizontally or vertically or slantways, and dots (Hussain, Durrani, & Gul, 2005; Maw, 2001; Alexander, 2003). From the segmentation point of view, the longest component to form a glyph is 8. But the maximum number of connected components is ranging from 1 to 4 and it can’t be greater than 4. The writing style of Myanmar script is done in upper, middle and lower zones. The sample of Myanmar glyph is as show in Figure 1. 3. THEORY AND RELATED WORK The methods for character segmentation can be roughly classified into three categories: straight segmentation method, recognition-based segmentation method, and cut classification method (Lee, Lee, & Park, 1996). In the first category, each word is segmented into several characters, and the character recognition techniques are applied to each segment. In spite of the simplicity in implementing this method, its limit comes from 7 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/character-segmentation-schemeocr-system/64185?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Computer Science, Security, and Information Technology, InfoSci-Artificial Intelligence and Smart Computing eJournal Collection, InfoSci-Journal Disciplines Engineering, Natural, and Physical Science, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science, InfoSciSelect, InfoSci-Social Sciences Knowledge Solutions – Journals, InfoSci-Computer Science and IT Knowledge Solutions – Journals. Recommend this product to your

[1]  Mumit Khan,et al.  Rule based segmentation of lower modifiers in complex Bangla scripts , 2009 .

[2]  Muhammad Abuzar Fahiem,et al.  Segmentation of Printed Urdu Scripts Using Structural Features , 2009, 2009 Second International Conference in Visualisation.

[3]  Jian-xiong Dong,et al.  An improved handwritten Chinese character recognition system using support vector machine , 2005, Pattern Recognit. Lett..

[4]  J. Mantas,et al.  An overview of character recognition methodologies , 1986, Pattern Recognit..

[5]  Seong-Whan Lee,et al.  A New Methodology for Gray-Scale Character Segmentation and Recognition , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Eric Lecolinet,et al.  A Survey of Methods and Strategies in Character Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Srikanta Pal,et al.  Line and Word Segmentation Approach for Printed Documents , 2010 .

[8]  Eduardo Romero,et al.  Biomedical Image Analysis and Machine Learning Technologies: Applications and Techniques , 2009 .

[9]  Rajendra Kumar Sharma,et al.  Segmentation of touching characters in upper zone in printed Gurmukhi script , 2009, COMPUTE '09.

[10]  David S. Doermann,et al.  Re-targetable OCR with Intelligent Character Segmentation , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[11]  Jose Garcia-Rodriguez,et al.  Robotic Vision: Technologies for Machine Learning and Vision Applications , 2013 .

[12]  Aly A. Farag,et al.  Assessment of Kidney Function Using Dynamic Contrast Enhanced MRI Techniques , 2010 .

[13]  Driss Aboutajdine,et al.  A New Image Distortion Measure Based on Natural Scene Statistics Modeling , 2012, Int. J. Comput. Vis. Image Process..

[14]  Sagarmay Deb Multimedia Systems and Content-Based Image Retrieval , 2003 .

[15]  Roseli A. F. Romero,et al.  Computer Vision for Learning to Interact Socially with Humans , 2013 .

[16]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[17]  Mohamed Fakir,et al.  Recognition of Color Objects Using Hybrids Descriptors , 2013, Int. J. Comput. Vis. Image Process..

[18]  Mandeep Kaur,et al.  OCR for Telugu Script Using Back-Propagation Based Classifier , 2010 .

[19]  Mohamed Fakir,et al.  Tifinaghe Document Converter , 2013, Int. J. Comput. Vis. Image Process..

[20]  V. Vijay Kumar,et al.  Segmentation of Printed Text in Devanagari Script and Gurmukhi Script , 2010 .

[21]  V. K. Govindan,et al.  Character recognition - A review , 1990, Pattern Recognit..

[22]  Umapada Pal,et al.  Multi-Oriented and Multi-Sized Touching Character Segmentation Using Dynamic Programming , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[23]  Zubair A. Shaikh,et al.  Character Segmentation of Sindhi, an Arabic Style Scripting Language, using Height Profile Vector , 2009 .

[24]  Syed M. Naqvi,et al.  A Hybrid Lossless-Lossy Binary Image Compression Scheme , 2013, Int. J. Comput. Vis. Image Process..