Improving Language Identification of Web Page Using Optimum Profile

Language is an indispensable tool for human communication, and presently, the language that dominates the Internet is English. Language identification is the process of determining a predetermined language automatically from a given content (e.g., English, Malay, Danish, Estonian, Czech, Slovak, etc.). The ability to identify other languages in relation to English is highly desirable. It is the goal of this research to improve the method used to achieve this end. Three methods have been studied in this research are distance measurement, Boolean method, and the proposed method, namely, optimum profile. From the initial experiments, we have found that, distance measurement and Boolean method is not reliable in the European web page identification. Therefore, we propose optimum profile which is using N-grams frequency and N-grams position to do web page language identification. The result show that the proposed method gives the highest performance with accuracy 91.52%.

[1]  Jilei Tian,et al.  n-gram and decision tree based language identification for written words , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[2]  Christian Biemann,et al.  Disentangling from Babylonian Confusion - Unsupervised Language Identification , 2005, CICLing.

[3]  Ali Selamat,et al.  Arabic Script Web Page Language Identification Using Hybrid-KNN Method , 2009, Int. J. Comput. Intell. Appl..

[4]  R. Cole,et al.  Survey of the State of the Art in Human Language Technology , 2010 .

[5]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[6]  Choon-Ching Ng,et al.  Arabic script language identification using letter frequency neural networks , 2008, Int. J. Web Inf. Syst..

[7]  Ali Selamat,et al.  Improve feature selection method of web page language identification using fuzzy ARTMAP , 2010, Int. J. Intell. Inf. Database Syst..

[8]  Gary Simons,et al.  Language identification and IT Addressing problems of linguistic diversity on a global scale , 2000 .

[9]  Ali Selamat,et al.  Arabic script web page language identifications using decision tree neural networks , 2011, Pattern Recognit..

[10]  A. Lawrence Spitz,et al.  Automatic language identification , 1997 .

[11]  José Gabriel Pereira Lopes,et al.  Identification of Document Language is Not yet a Completely Solved Problem , 2006, 2006 International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce (CIMCA'06).

[12]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[13]  Ioannis Pitas,et al.  Language identification in web documents using discrete HMMs , 2004, Pattern Recognit..

[14]  Ibrahim Sogukpinar,et al.  Letter Based Text Scoring Method for Language Identification , 2004, ADVIS.

[15]  Tatyana Yakhno,et al.  Advances in Information Systems , 2002, Lecture Notes in Computer Science.

[16]  Yoshiki Mikami,et al.  Multilingual ICT education: language observatory as a monitoring instrument , 2005 .

[17]  Chew Yew Choong,et al.  Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages , 2009 .

[18]  Mário J. Silva,et al.  Language identification in web pages , 2005, SAC '05.