Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages

Language identification technology is widely used in the domains of machine learning and text mining. Many researchers have achieved excellent results on a few selected European languages. However, the majority of African and Asian languages remain untested. The primary objective of this research is to evaluate the performance of our new n‑gram based language identification algorithm on 68 written languages used in the European, African and Asian regions. The secondary objective is to evaluate how n‑gram orders and a mix n‑gram model affect the relative performance and accuracy of language identification. The n-gram based algorithm used in this paper does not depend on the n‑gram frequency. Instead, the algorithm is based on a Boolean method to determine the output of matching target n‑grams to training n‑grams. The algorithm is designed to automatically detect the language, script and character encoding scheme of a written text. It is important to identify these three properties due to the reason that a language can be written in different types of scripts and encoded with different types of character encoding schemes. The experimental results show that in one test the algorithm achieved up to 99.59% correct identification rate on selected languages. The results also show that the performance of language identification can be improved by using a mix n‑gram model of bigram and trigram. The mix n-gram model consumed less disk space and computing time, compared to a trigram model. DOI: 10.4038/icter.v2i2.1385 The International Journal on Advances in ICT for Emerging Regions 2009 02 (02): 21-28

[1]  John C. Paolillo,et al.  Measuring linguistic diversity on the internet , 2005 .

[2]  Ario Ohsato,et al.  A language and character set determination method based on N-gram statistics , 2002, TALIP.

[3]  Jean-Philippe Thiran,et al.  Text identification in complex background using SVM , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[5]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[6]  Sebastiano Vigna,et al.  The language observatory project (LOP) , 2005, WWW '05.