A Structural Analysis Based Feature Extraction Method for OCR System For Myanmar Printed Document Images

This paper proposes a new feature extraction method for off-line recognition of Myanmar printed documents. One of the most important factors to achieve high recognition performance in Optical Character Recognition (OCR) system is the selection of the feature extraction methods. Different types of existing OCR systems used various feature extraction methods because of the diversity of the scripts’ natures. One major contribution of the work in this paper is the design of logically rigorous coding based features. To show the effectiveness of the proposed method, this paper assumed the documents are successfully segmented into characters and extracted features from these isolated Myanmar characters. These features are extracted using structural analysis of the Myanmar scripts. The experimental results have been carried out using the Support Vector Machine (SVM) classifier and compare the pervious proposed feature extraction method. DOI: 10.4018/ijcvip.2012010102 International Journal of Computer Vision and Image Processing, 2(1), 16-41, January-March 2012 17 Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. small if a statistical classifier is to be used (curse of dimensionality) (Duda, Hart, & Stork, 2001). A feature extraction method that proves to be successful in one application domain may turn out not to be very useful in another domain. Furthermore, the type of features extracted must match the requirements of the chosen classifier. There are both many built-in methods and various invention techniques for Feature Extraction of the different OCR systems. All of these methods can be categorized into two types of features: statistical, derived from statistical distribution of points and structural. The most common statistical features used for character representation are: zoning, projections and crossings and distances. Structural features are based on topological and geometrical properties of the character (Vamvakas, Gatos, & Perantonis, 2009). Furthermore, global transformations techniques such as Gabor and Hough Transformations are now becoming popular in some recognition systems. All different types of feature extraction methods are surveyed in Trier, Jain, and Taxt (1996) for the previous century and illustrated that the best method for each application domain can’t be the best for all applications. The successful usage of statistical features are described in Rajashekararadhya and Ranjan (2005) and Kumar (2010) the former used 50 features and the later used considerable large amounts of features. For the printed Gurmukhi Script (Jindal, Sharma, & Sharma, 2008) and historical documents (Vamvakas, Gatos, & Perantonis, 2009) structural features are used depending on the nature of scripts and by subdividing the character image. They also show their best results. Features from Gabor filters are used mostly in scrip identification in bilingual documents and in Ramanathan, Ponmathavan, Thaneshwaran, Nair, Valliappan, and Soman (2009), Borji and Hamidi (2007), and Ramanathan Nair, Thaneshwaran, Ponmathavan, Valliappan, and Soman (2009), they also selected for their OCR systems. When analyzing the previous literatures for Myanmar scripts, the OCR system for printed documents Swe and Tin (2006) used all the pixel values as the features. The group of Myanmar Intelligent Character Recognition (MICR) Thein and Yee (2010) used combination of Statistical and usual Structural features such as number of pixels, stroke counts, number of loop and open direction and can only show the good accuracy for normal alphabets and not yet done for all the compound words. The rule based feature extraction method is used in Than, Aung, Yi, and Win (2006) and they only showed the five alphabets of handwritten characters. To the best of our knowledge, the extensive researches for OCR system for Myanmar scripts are needed to be done. The reason for this may be that there are a large number of characters in Myanmar script and because of the complex nature of this script. In Win, Khine, and Tun (2011b) zoning and projection profile based feature extraction method for printed documents is proposed, but this method can correctly recognize only the trained data such as trained font types and sizes. Although, in the machine printed documents, shape discrepancy among characters belonging to same class is sometimes quite large because of the degradations of the document images. Therefore, it is required to select features, which can adapt the shape variations due to touching noise blobs. In fact the main problem in OCR system is the large variation in shapes within a class of character (Borji & Hamidi, 2007). Moreover, large amount of trained data can prevent the high performance of the OCR system. Hence, in South-East Asian Scripts, including Myanmar, one-stage discrimination does not generally suffice and two-stage classification (coarse and fine) should use. The aim of coarse classification is to cluster similar-looking characters into groups and then perform fine classification to extract the right class (Agrawal & Doermann, 2008; Kato, Suzuki, Omachi, Aso, & Nemoto, 1999). Therefore, in this paper, the characters are logically clustered and propose a new feature extraction method using structural features depending on the nature of writing style of the Myanmar Scripts in order to get the more accuracy of the OCR system. One major contribution of the work presented in this paper is the design of mathematically 24 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/structural-analysis-based-featureextraction/68002?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Computer Science, Security, and Information Technology, InfoSci-Artificial Intelligence and Smart Computing eJournal Collection, InfoSci-Journal Disciplines Engineering, Natural, and Physical Science, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science, InfoSciSelect. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2

[1]  Sanjoy Pratihar,et al.  On Applying the Farey Sequence for Shape Representation in Z2 , 2012 .

[2]  R. Ramanathan,et al.  Tamil Font Recognition Using Gabor Filters and Support Vector Machines , 2009, 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies.

[3]  Satish Kumar,et al.  Neighborhood Pixels Weights-A New Feature Extractor , 2009 .

[4]  Francisco Rovira-Más Stereoscopic Vision for Off-Road Intelligent Vehicles , 2014 .

[5]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[6]  Rajendra Kumar Sharma,et al.  Structural Features for Recognizing Degraded Printed Gurmukhi Script , 2008, Fifth International Conference on Information Technology: New Generations (itng 2008).

[7]  Uma Shanker Tiwary,et al.  Speech, Image and Language Processing for Human Computer Interaction: Multi-Modal Advancements , 2012 .

[8]  Khaled S. Ahmed,et al.  Estimating Protein Functions Correlation Based on Overlapping Proteins and Cluster Interactions , 2012 .

[9]  Phyo Thu Thu Khine,et al.  Converting Myanmar printed document image into machine understandable text format , 2011, 2011 Sixth International Conference on Digital Information Management.

[10]  Anil K. Jain,et al.  Feature extraction methods for character recognition-A survey , 1996, Pattern Recognit..

[11]  Stavros J. Perantonis,et al.  A Novel Feature Extraction and Classification Methodology for the Recognition of Historical Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[12]  Ghalem Belalem,et al.  A Reflexion on Implementation Version for Active Appearance Model , 2013, Int. J. Comput. Vis. Image Process..

[13]  Jian-xiong Dong,et al.  An improved handwritten Chinese character recognition system using support vector machine , 2005, Pattern Recognit. Lett..

[14]  Yadana Thein,et al.  High Accuracy Myanmar Handwritten Character Recognition using Hybrid approach through MICR and Neural Network , 2010 .

[15]  Phyo Thu Thu Khine,et al.  Character Segmentation Scheme for OCR System: For Myanmar Printed Documents , 2011, Int. J. Comput. Vis. Image Process..

[16]  Tae-Sun Choi,et al.  Depth Map and 3D Imaging Applications: Algorithms and Technologies , 2011 .

[17]  David S. Doermann,et al.  Re-targetable OCR with Intelligent Character Segmentation , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[18]  T. Swe,et al.  Recognition and Translation of the Myanmar Printed Text Based on Hopfield Neural Network , 2005, 6th Asia-Pacific Symposium on Information and Telecommunication Technologies.

[19]  Nei Kato,et al.  A Handwritten Character Recognition System Using Directional Element Feature and Asymmetric Mahalanobis Distance , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  R. Ramanathan,et al.  Robust Feature Extraction Technique for Optical Character Recognition , 2009, 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies.