Global Approach for Script Identification using Wavelet Packet Based Features

In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in seven scripts, to categorize them for further processing. The South Indian documents printed in the seven scripts - Kannada, Tamil, Telugu, Malayalam, Urdu, Hindi and English are considered here The document images are decomposed through the Wavelet Packet Decomposition using the Haar basis function up to level two. Gray level co-occurrence matrix is constructed for the coefficient sub bands of the wavelet transform. The Haralick texture features are extracted from the co-occurrence matrix and then used in the identification of the script of a machine printed document. Experimentation conducted involved 2100 text images for learning and 1400 text images for testing. Script classification performance is analyzed using the K-nearest neighbor classifier. The average success rate is found to be 99.68%.

[1]  Sankar K. Pal,et al.  International Journal of Signal Processing , Image Processing and Pattern Recognition , 2008 .

[2]  U. Pal,et al.  Multi-script line identification from Indian documents , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  Yue Lu,et al.  Bangla/English Script Identification Based on Analysis of Connected Component Profiles , 2006, Document Analysis Systems.

[4]  Ronald R. Coifman,et al.  Wavelet analysis and signal processing , 1990 .

[5]  M. C. Padma,et al.  Language Identification of Kannada, Hindi and English Text Words Through Visual Discriminating Features , 2008, Int. J. Comput. Intell. Syst..

[6]  Jayanthi Sivaswamy,et al.  Script Identification from Indian Documents , 2006, Document Analysis Systems.

[7]  Sally L. Wood,et al.  Language identification for printed text independent of segmentation , 1995, Proceedings., International Conference on Image Processing.

[8]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[10]  Sridha Sridharan,et al.  Texture for script identification , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Tieniu Tan,et al.  Script and Language Identification from Document Images , 1997, BMVC.

[12]  Santanu Chaudhury,et al.  Trainable script identification strategies for Indian languages , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[13]  N. V. Subbareddy,et al.  Neural network based system for script identification in Indian documents , 2002 .

[14]  P. S. Hiremath,et al.  Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image , 2008, Pattern Recognit. Lett..

[15]  Anil K. Jain,et al.  Page segmentation using tecture analysis , 1996, Pattern Recognit..

[16]  Patrick Kelly,et al.  Automatic script identification from images using cluster-based templates , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[17]  P Hema Menon,et al.  Script identification from document images using gabor filters , 2006 .

[18]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Rafael C. González,et al.  Digital image processing using MATLAB , 2006 .

[20]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[21]  V. S. Malemath,et al.  Word Level Script Identification in Bilingual Documents through Discriminating Features , 2007, 2007 International Conference on Signal Processing, Communications and Networking.