Identification of different script lines from multi-script documents

Abstract For wider readership, some documents may be printed in several scripts and languages. For optical character recognition (OCR) of such a document page, a software module is necessary to identify the scripts before feeding them to their individual OCR systems. This paper deals with an automatic technique for the identification of printed Roman, Chinese, Arabic, Devnagari and Bangla text lines from a single document. For this purpose script characteristics, shape-based features, statistical features and some features obtained from the concept of water overflow from the reservoir have been employed. The scheme shows an accuracy of about 97.33%.

[1]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Sabri A. Mahmoud,et al.  Arabic character recognition using fourier descriptors and character contour encoding , 1994, Pattern Recognit..

[3]  Patrick Kelly,et al.  Automatic Script Identification From Document Images Using Cluster-Based Templates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Bidyut Baran Chaudhuri,et al.  Automatic detection of italic, bold and all-capital words in document images , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[5]  Sally L. Wood,et al.  Language identification for printed text independent of segmentation , 1995, Proceedings., International Conference on Image Processing.

[6]  Bidyut Baran Chaudhuri,et al.  Script line separation from Indian multi-script documents , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[7]  Tieniu Tan,et al.  Rotation Invariant Texture Features and Their Use in Automatic Script Identification , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Friedrich M. Wahl,et al.  Document Analysis System , 1982, IBM J. Res. Dev..

[9]  Ching Y. Suen,et al.  n-Gram Statistics for Natural Language Understanding and Text Processing , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Jie Ding,et al.  Classification of oriental and European scripts by using characteristic features , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[11]  Bidyut Baran Chaudhuri,et al.  A complete printed Bangla OCR system , 1998, Pattern Recognit..