Neural Text Categorizer for topic identification of noisy Arabic Texts

This paper deals with the topic identification problem, which consists of recognizing the subject in which the text is written. Despite there exist several statistical and machine learning approaches addressing the tackled problem, unfortunately, most of them assume relatively clean and long texts, and they present failure in corrupted or short texts. Moreover, there are few works were undergone on the Arabic language which is a rich language and the more complex one. For that reason, we aimed to conduct our investigation in topic identification of noisy Arabic texts. To overcome the addressed problem, we present the design and implementation of the Neural Text Categorizer (NTC), which is a novel Neural Network and different from the existing NNs in some concepts. Furthermore, we present and discuss the proposed improvement of the NTC (called NTCT), where it is based on TF-IDF weights and consists of modifying the input vector and the classification formula. The empirical evaluation of the two algorithms was undergone on in-house corpus (called ANTSIX) containing discussion forum texts. We also carried out a comparison between our best findings and the state of the art. We found that the proposed NTCT maintained consistently high performances and outperformed several algorithms in topic identification of noisy Arabic texts.

[1]  Mourad Abbas,et al.  Comparison of Topic Identification methods for Arabic Language , 2005 .

[2]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[3]  Krista Lagus,et al.  Topic Identification in Natural Language Dialogues Using Neural Networks , 2002, SIGDIAL Workshop.

[4]  Herbert Gish,et al.  Approaches to topic identification on the switchboard corpus , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Amanda Spink,et al.  Neural network applications for automatic new topic identification on excite web search engine data logs , 2004, ASIST.

[6]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[7]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[8]  Taeho Jo,et al.  Neural Text Categorizer for Exclusive Text Categorization , 2008, J. Inf. Process. Syst..

[9]  Kamel Smaïli,et al.  Evaluation of Topic Identification Methods on Arabic Corpora , 2011, J. Digit. Inf. Manag..

[10]  J.-P. Haton,et al.  A comparative study of topic identification on newspaper and e-mail , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[11]  William E. Moen,et al.  Using Encyclopedic Knowledge for Automatic Topic Identification , 2009, CoNLL.

[12]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[13]  Kamel Smaïli,et al.  TR-Classifier and kNN Evaluation for Topic Identification tasks , 2010 .

[14]  Louis Massey,et al.  Autonomous and Adaptive Identification of Topics in Unstructured Text , 2011, KES.

[15]  Guoyong Cai,et al.  Exploring Social Context for Topic Identification in Short and Noisy Texts , 2015, AAAI.

[16]  Pavel Ircing,et al.  Automatic Topic Identification for Large Scale Language Modeling Data Filtering , 2011, TSD.

[17]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[18]  Rada Mihalcea,et al.  Topic Identification Using Wikipedia Graph Centrality , 2009, NAACL.

[19]  Rosni Abdullah,et al.  Automatic Topic Identification Using Ontology Hierarchy , 2001, CICLing.