Word-length algorithm for language identification of under-resourced languages

Language identification is widely used in machine learning, text mining, information retrieval, and speech processing. Available techniques for solving the problem of language identification do require large amount of training text that are not available for under-resourced languages which form the bulk of the World's languages. The primary objective of this study is to propose a lexicon based algorithm which is able to perform language identification using minimal training data. Because language identification is often the first step in many natural language processing tasks, it is necessary to explore techniques that will perform language identification in the shortest possible time. Hence, the second objective of this research is to study the effect of the proposed algorithm on the run-time performance of language identification. Precision, recall, and F1 measures were used to determine the effectiveness of the proposed word length algorithm using datasets drawn from the Universal Declaration of Human Rights Act in 15 languages. The experimental results show good accuracy on language identification at the document level and at the sentence level based on the available dataset. The improved algorithm also showed significant improvement in run time performance compared with the spelling checker approach.

[1]  Linda Martindale Bridging the digital divide in South Africa , 2002 .

[2]  Thomas Gottron,et al.  A Comparison of Language Identification Approaches on Short, Query-Style Texts , 2010, ECIR.

[3]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[4]  Abdelmalek Amine,et al.  Automatic Language Identification: An Alternative Unsupervised Approach Using a New Hybrid Algorithm , 2010, Int. J. Comput. Sci. Appl..

[5]  Robin Nagano,et al.  Language Identification of Web Pages Based on Improved N-gram Algorithm , 2011 .

[6]  José Gabriel Pereira Lopes,et al.  Identification of Document Language is Not yet a Completely Solved Problem , 2006, 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA'06).

[7]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[8]  Tommi Vatanen,et al.  Language Identification of Short Text Segments with N-gram Models , 2010, LREC.

[9]  Imran Sarwar Bajwa,et al.  Translating natural language constraints to OCL , 2012, J. King Saud Univ. Comput. Inf. Sci..

[10]  Thomas Mandl,et al.  Barriers to Information Access across Languages on the Internet: Network and Language Effects , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[11]  Ali Selamat,et al.  Improved N-grams Approach for Web Page Language Identification , 2011, Trans. Comput. Collect. Intell..

[12]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[13]  Dirk Snyman,et al.  Spelling Checker-based Language Identification for the Eleven Official South African Languages , 2010 .

[14]  Ee-Peng Lim,et al.  Web classification of conceptual entities using co-training , 2011, Expert Syst. Appl..

[15]  P. Lewis Ethnologue : languages of the world , 2009 .

[16]  H. Isahara,et al.  Language identification based on string kernels , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..

[17]  Carlos N-gram models for language detection , 2008 .

[18]  Ali Selamat,et al.  Arabic script web page language identifications using decision tree neural networks , 2011, Pattern Recognit..

[19]  Ralf D. Brown,et al.  Finding and identifying text in 900+ languages , 2012, Digit. Investig..

[20]  Sartaj Sahni Analysis of Algorithms , 2004 .

[21]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[22]  Ali Selamat,et al.  Arabic Script Web Page Language Identification Using Hybrid-KNN Method , 2009, Int. J. Comput. Intell. Appl..

[23]  Lyle Campbell,et al.  Ethnologue: Languages of the world (review) , 2008 .

[24]  Vennila Ramalingam,et al.  A hierarchical language identification system for Indian languages , 2012, Digit. Signal Process..

[25]  Lukás Burget,et al.  Application of speaker- and language identification state-of-the-art techniques for emotion recognition , 2011, Speech Commun..

[26]  Harald Hammarstr-om A Fine-Grained Model for Language Identification , 2007 .

[27]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[28]  Etienne Barnard,et al.  Factors that affect the accuracy of text-based language identification , 2012, Comput. Speech Lang..

[29]  Viviana Mascardi,et al.  Statistical Language Identification of Short Texts , 2011, ICAART.

[30]  Xi Yang,et al.  An N-Gram-and-Wikipedia joint approach to Natural Language Identification , 2010, 2010 4th International Universal Communication Symposium.

[31]  AbdulMalik Al-Salman A Bi-directional Bi-Lingual Translation Braille-Text System , 2008, J. King Saud Univ. Comput. Inf. Sci..

[32]  Gerrit Reinier Botha Text-based language identification for the South African languages , 2008 .

[33]  Ali Selamat,et al.  Improving Language Identification of Web Page Using Optimum Profile , 2011, ICSECS.

[34]  Margaret Miró-Julià,et al.  Data Mining Techniques for Web Page Classification , 2011, PAAMS.

[35]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[36]  Andreas Nürnberger,et al.  A Comparative Study on Language Identification Methods , 2008, LREC.

[37]  W. Weerkamp,et al.  Semi-Supervised Priors for Microblog Language Identification , 2011 .

[38]  Yonghong Yan,et al.  Maximum A Posteriori Linear Regression for language recognition , 2012, Expert Syst. Appl..

[39]  Mykola Pechenizkiy,et al.  Graph-Based N-gram Language Identication on Short Texts , 2011 .

[40]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[41]  Chew Yew Choong,et al.  Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages , 2009 .

[42]  BarnardEtienne,et al.  Factors that affect the accuracy of text-based language identification , 2012 .