Identification of Latin-Based Languages through Character Stroke Categorization

This paper presents a language identification technique that detects Latin-based languages of imaged documents without OCR. The proposed technique detects languages through the word shape coding, which converts each word image into a word shape code and accordingly transforms each document image into an electronic document vector. For each Latin-based language under study, a language template is first constructed through a corpus-based learning process. The underlying language of the query document is then determined based on the similarity between the query document vector and multiple constructed language templates. Compared with the reported methods, the proposed language identification technique is fast, accurate, and tolerant to text segmentation error caused by noise and various types of document degradation. Experimental results show some promising results.

[1]  Chew Lim Tan,et al.  Script and Language Identification in Degraded and Distorted Document Images , 2006, AAAI.

[2]  Ching Y. Suen,et al.  Categorizing Document Images into Script and Language Classes , 1999 .

[3]  A. Lawrence Spitz,et al.  Determination of the Script and Language Content of Document Images , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Ching Y. Suen,et al.  Language identification of on-line documents using word shapes , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[5]  Takehiro Nakayama Modeling Content Identification from Document Images , 1994, ANLP.

[6]  Chew Lim Tan,et al.  Imaged Document Text Retrieval Without OCR , 2002, IEEE Trans. Pattern Anal. Mach. Intell..