Arabic Script Web Page Language Identification Using Hybrid-KNN Method

In this paper, we proposed hybrid-KNN methods on the Arabic script web page language identification. One of the crucial tasks in the text-based language identification that utilizes the same script is how to produce reliable features and how to deal with the huge number of languages in the world. Specifically, it has involved the issue of feature representation, feature selection, identification performance, retrieval performance, and noise tolerance performance. Therefore, there are a number of methods that have been evaluated in this work; k-nearest neighbor (KNN), support vector machine (SVM), backpropagation neural networks (BPNN), hybrid KNN-SVM, and KNN-BPNN, in order to justify the capability of the state-of-the-art methods. KNN is prominent in data clustering or data filtering, SVM and BPNN are well known in supervised classification, and we have proposed hybrid-KNN for noise removal on web page language identification. We have used the standard measurements which are accuracy, precision, recall and F1 measurements to evaluate the effectiveness of the proposed hybrid-KNN. From the experiment, we have observed that BPNN is able to produce precise identification if the data set given is clean. However, when increasing the level of noise in the training data, KNN-SVM performs better than KNN-BPNN against the misclassification data, even on the level of 50% noise. Therefore, it is proven that KNN-SVM produce promising identification performance, in which KNN is able to reduce the noise in the data set and SVM is reliable in the language identification.

[1]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Wen-zhong Guo,et al.  Chinese Web page classification using noise-tolerant support vector machines , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Hans Uszkoreit,et al.  A system for supporting cross-lingual information retrieval , 2000, Inf. Process. Manag..

[5]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[6]  Majida Albakoor,et al.  Region growing based segmentation algorithm for typewritten and handwritten text recognition , 2009, Appl. Soft Comput..

[7]  Dirk Maclean,et al.  Beyond English: Transnational corporations and the strategic management of language in a complex multilingual business environment , 2006 .

[8]  Choon-Ching Ng,et al.  Arabic script language identification using letter frequency neural networks , 2008, Int. J. Web Inf. Syst..

[9]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[10]  Ioannis Pitas,et al.  Language identification in web documents using discrete HMMs , 2004, Pattern Recognit..

[11]  Jilei Tian,et al.  n-gram and decision tree based language identification for written words , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Dino Isa,et al.  Using the self organizing map for clustering of text documents , 2009, Expert Syst. Appl..

[14]  William M. Campbell,et al.  Language recognition with support vector machines , 2004, Odyssey.

[15]  Herbert Gish,et al.  Discriminatively trained Language Models using Support Vector Machines for Language Identification , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[16]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[17]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[18]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[19]  Ignacio Redondo‐Bellón The effects of bilingualism on the consumer: the case of Spain , 1999 .

[20]  Yoshiki Mikami,et al.  Multilingual ICT education: language observatory as a monitoring instrument , 2005 .

[21]  Padraig Cunningham,et al.  Neural Networks for Language Identification: A Comparative Study , 1998, Inf. Process. Manag..

[22]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[23]  Christian Biemann,et al.  Disentangling from Babylonian Confusion - Unsupervised Language Identification , 2005, CICLing.

[24]  Fuji Ren,et al.  GA, MR, FFNN, PNN and GMM based models for automatic text summarization , 2009, Comput. Speech Lang..