An arabic lemma-based stemmer for latent topic modeling

Developments in Arabic information retrieval did not follow the increasing use of the Arabic Web during the last decade. Semantic indexing in a language with high inflectional morphology, such as Arabic, is not a trivial task and requires a text analysis in the original language. Excepting cross-language retrieval methods or limited studies, the main efforts, for developing semantic analysis methods and topic modeling, did not include Arabic text. This paper describes our approach for analyzing semantics in Arabic texts. A new lemma-based stemmer is developed and compared to root-based one for characterizing Arabic text. The Latent Dirichlet Allocation (LDA) model is adapted to extract Arabic latent topics from various real-world corpora. In addition to the interesting subjects discovered in the press articles during the 2007-2009 period, experiments show that the classification performances with lemma-based stemming in the topics space, are improved when comparing to classification with root-based stemming.

[1]  ChengXiang Zhai,et al.  Cross-Lingual Latent Topic Extraction , 2010, ACL.

[2]  Ramesh Nallapati,et al.  Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence of Blogs , 2021, ICWSM.

[3]  Thorsten Brants,et al.  Arabic DocumentTopic Analysis , 2002 .

[4]  Padhraic Smyth,et al.  Combining concept hierarchies and statistical topic models , 2008, CIKM '08.

[5]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[6]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[7]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[10]  Laurence Tuerlinckx,et al.  La lemmatisation de l'arabe non classique , 2004 .

[11]  Hal Daumé,et al.  Extracting Multilingual Topics from Unaligned Comparable Corpora , 2010, ECIR.

[12]  Fredric C. Gey,et al.  The TREC 2002 Arabic/English CLIR Track , 2002, TREC.

[13]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[14]  Haidar Moukdad,et al.  Stemming and root-based approaches to the retrieval of Arabic documents on the Web , 2006, Webology.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Naonori Ueda,et al.  Probabilistic latent semantic visualization: topic model for visualizing documents , 2008, KDD.

[17]  Ahmad T. Al-Taani,et al.  A rule-based approach for tagging non-vocalized Arabic words , 2009, Int. Arab J. Inf. Technol..

[18]  Victor Lavrenko,et al.  Language-specific models in multilingual topic tracking , 2004, SIGIR '04.

[19]  Padhraic Smyth,et al.  Subject metadata enrichment using statistical topic models , 2007, JCDL '07.

[20]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[21]  Alaa M. El-Halees,et al.  Arabic Text Classification Using Maximum Entropy , 2015 .

[22]  Jian-Yun Nie,et al.  Effective Stemming for Arabic Information Retrieval , 2006, BCS.