Evaluation of Topic Identification Methods on Arabic Corpora

Topic Identification is one of the important keys for the success of many applications. Indeed, there are few works in this field concerning Arabic language because of lack of standard corpora. In this study, we will provide directly comparable results of six text categorization methods on a new Arabic corpus Alwatan-2004. Hence, Topic Unigram Language Model (TULM), Term Frequency/Inverse Document Frequency (TFIDF), Neural Network, SVM, M-SVM and TR have been experimented, and showed that TR-Classifier is the most efficient among the set of classifiers, nevertheless, only binary SVM outperformed it thanks to its characteristics. Moreover, we should note that the size of Alwatan-2004 corpus used to achieve our experiments is considered the most important compared to any other Arabic corpus which had been used for topic identification experiments until now. In addition, we aim through using small sizes of vocabularies to reduce the time of computation. This is important for adaptive language modeling, particularly Topic Adaptation, which is required in real time applications such as speech recognition and machine translation systems. Our experiments indicate that the results are better than other works dealing with Arabic text categorization.

[1]  Hermann Ney,et al.  Selection criteria for word trigger pairs in language modelling , 1996, ICGI.

[2]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[3]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[4]  Alaa M. El-Halees Mining Arabic Association Rules for Text Classification , 2006 .

[5]  Ronald Rosenfeld,et al.  Nonlinear interpolation of topic models for language model adaptation , 1998, ICSLP.

[6]  Kamel Smaïli,et al.  Multi-category support vector machines for identifying Arabic topics , 2009, CICLing 2009.

[7]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[8]  Kristin P. Bennett,et al.  Multicategory Classification by Support Vector Machines , 1999, Comput. Optim. Appl..

[9]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[10]  Hassan Satori,et al.  INVESTIGATION ARABIC SPEECH RECOGNITION FROM SIGNAL TO ITS INTERPRETATION , 2008 .

[11]  Gianluca Pollastri,et al.  Combining protein secondary structure prediction models with ensemble methods of optimal complexity , 2004, Neurocomputing.

[12]  Kamel Smaïli,et al.  Dynamic Topic Identification: Towards Combination of Methods , 2001 .

[13]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[14]  Kamel Smaïli,et al.  A Comparative Study of Topic Identification on Newspaper and E-mail , 2001, SPIRE.

[15]  R. Duwairi,et al.  Stemming Versus Light Stemming as Feature Selection Techniques for Arabic Text Categorization , 2007, 2007 Innovations in Information Technologies (IIT).

[16]  Yi Lin Multicategory Support Vector Machines, Theory, and Application to the Classification of . . . , 2003 .

[17]  Yaakov HaCohen-Kerner,et al.  WORDS AS CLASSIFIERS OF DOCUMENTS ACCORDING TO THEIR HISTORICAL PERIOD AND THE ETHNIC ORIGIN OF THEIR AUTHORS , 2008, Cybern. Syst..

[18]  Renato De Mori,et al.  A fuzzy decision strategy for topic identification and dynamic selection of language models , 2000, Signal Process..

[19]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[20]  Guodong Zhou,et al.  Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition , 1999, Comput. Speech Lang..

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Saleh Alsaleem,et al.  Automated Arabic Text Categorization Using SVM and NB , 2011, Int. Arab. J. e Technol..

[23]  Kamel Smaïli,et al.  Comparing TR-Classifier and KNN by using Reduced Sizes of Vocabularies , 2009 .

[24]  Hélène Paugam-Moisy,et al.  A new multi-class SVM based on a uniform convergence result , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[25]  Mohamed S. Abdel-Wahab,et al.  An Intelligent System For Arabic Text Categorization , 2006 .

[26]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[27]  Ronald Rosenfeld,et al.  Using story topics for language model adaptation , 1997, EUROSPEECH.

[28]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[29]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[30]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[31]  Xuedong Huang,et al.  Improved topic-dependent language modeling using information retrieval techniques , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[32]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[33]  Fouzi Harrag,et al.  Improving Arabic Text Categorization Using Neural Network with SVD , 2010, J. Digit. Inf. Manag..

[34]  Hermann Ney,et al.  Adaptive topic - dependent language modelling using word - based varigrams , 1997, EUROSPEECH.

[35]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[36]  Chris Clifton,et al.  TopCat: data mining for topic identification in a text corpus , 1999, IEEE Transactions on Knowledge and Data Engineering.