论文信息 - Neural network based system for script identification in Indian documents

Neural network based system for script identification in Indian documents

The paper describes a neural network-based script identification system which can be used in the machine reading of documents written in English, Hindi and Kannada language scripts. Script identification is a basic requirement in automation of document processing, in multi-script, multi-lingual environments. The system developed includes a feature extractor and a modular neural network. The feature extractor consists of two stages. In the first stage the document image is dilated using 3 X 3 masks in horizontal, vertical, right diagonal, and left diagonal directions. In the next stage, average pixel distribution is found in these resulting images. The modular network is a combination of separately trained feedforward neural network classifiers for each script. The system recognizes 64 X 64 pixel document images. In the next level, the system is modified to perform on single word-document images in the same three scripts. Modified system includes a pre-processor, modified feature extractor and probabilistic neural network classifier. Pre-processor segments the multi-script multi-lingual document into individual words. The feature extractor receives these word-document images of variable size and still produces the discriminative features employed by the probabilistic neural classifier. Experiments are conducted on a manually developed database of document images of size 64 X 64 pixels and on a database of individual words in the three scripts. The results are very encouraging and prove the effectiveness of the approach.

[1] Ching Y. Suen,et al. Historical review of OCR research and development , 1992, Proc. IEEE.

[2] Bidyut Baran Chaudhuri,et al. Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[3] Philip D. Wasserman,et al. Advanced methods in neural computing , 1993, VNR computer library.

[4] V. K. Govindan,et al. Character recognition - A review , 1990, Pattern Recognit..

[5] George Nagy,et al. Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[6] J. Mantas,et al. An overview of character recognition methodologies , 1986, Pattern Recognit..

[7] Patrick Kelly,et al. Automatic Script Identification From Document Images Using Cluster-Based Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[8] Tieniu Tan. Written language recognition based on texture analysis , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[9] A. Lawrence Spitz,et al. Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[10] Tieniu Tan,et al. Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11] Bidyut Baran Chaudhuri,et al. Automatic separation of words in multi-lingual multi-script Indian documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[12] Y.K. Muthusamy,et al. Reviewing automatic language identification , 1994, IEEE Signal Processing Magazine.

[13] P. Nagabhushan,et al. A connectionist expert system model for conflict resolution in unconstrained handwritten numeral recognition , 1998, Pattern Recognit. Lett..

[14] Anil K. Jain,et al. Page segmentation using tecture analysis , 1996, Pattern Recognit..

[15] Santanu Chaudhury,et al. Trainable script identification strategies for Indian languages , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).