TR-Classifier and kNN Evaluation for Topic Identification tasks

This paper focuses on studying topic identification for Arabic language by using two methods. The first method is the well-known kNN (k Nearest Neighbors) which is used as baseline. The second one is the TR-Classifier, mainly based on computing triggers. The experiments show that TR-Classifier has the advantage to give best performances compared to kNN, by using much reduced sizes of Topic Vocabularies. TR-Classifier performance is enhanced by increasing jointly the number of triggers and the size of topic vocabularies. It should be noted that topic vocabularies are used by the TR-Classifier. Whereas, a general vocabulary is needed for kNN, and it is obtained by the concatenation of those used by the TR-Classifier. In addition to the standard measures Recall and Precision used for the evaluation step, we have drawn ROC curves for some topics to illustrate more clearly the difference in performance between the two classifiers. The corpus used in our experiments is downloaded from an online Arabic newspaper. Its size is about 10 millions words, distributed over six selected topics, in this case: culture, religion, economy, local news, international news and sports.

[1]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[2]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[3]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[4]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[5]  Mourad Abbas,et al.  Comparison of Topic Identification methods for Arabic Language , 2005 .

[6]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[7]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[8]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[9]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[10]  Chin-Yew Lin,et al.  Robust automated topic identification , 1997 .

[11]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[15]  Zhou Guodong,et al.  Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition , 1999 .

[16]  Hermann Ney,et al.  Selection criteria for word trigger pairs in language modelling , 1996, ICGI.

[17]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[18]  David L. Waltz,et al.  Trading MIPS and memory for knowledge engineering , 1992, CACM.

[19]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[20]  Hermann Ney,et al.  Word Triggers and the EM Algorithm , 1997, CoNLL.

[21]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[22]  Mona T. Diab,et al.  Arabic Named Entity Recognition: An SVM-based approach , 2008 .

[23]  Ronald Rosenfeld,et al.  Using story topics for language model adaptation , 1997, EUROSPEECH.

[24]  Kamel Smaïli,et al.  Reconnaissance Automatique de la Parole Du signal à son interprétation , 2006 .

[25]  Kostas Tzeras,et al.  Automatic indexing based on Bayesian inference networks , 1993, SIGIR.

[26]  Kamel Smaïli,et al.  Comparing TR-Classifier and KNN by using Reduced Sizes of Vocabularies , 2009 .

[27]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[28]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[29]  Paolo Rosso,et al.  Clustering Abstracts of Scientific Texts Using the Transition Point Technique , 2006, CICLing.

[30]  Yassine Benajiba,et al.  ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy , 2009, CICLing.

[31]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[32]  Ronald Rosenfeld,et al.  Nonlinear interpolation of topic models for language model adaptation , 1998, ICSLP.