Enhancing Root Extractors Using Light Stemmers

The rise of Natural Language Processing (NLP) opened new possibilities for various applications that were not applicable before. A morphological-rich language such as Arabic introduces a set of features, such as roots, that would assist the progress of NLP. Many tools were developed to capture the process of root extraction (stemming). Stemmers have improved many NLP tasks without explicit knowledge about its stemming accuracy. In this paper, a study is conducted to evaluate various Arabic stemmers. The study is done as a series of comparisons using a manually annotated dataset, which shows the efficiency of Arabic stemmers, and points out potential improvements to existing stemmers. The paper also presents enhanced root extractors by using light stemmers as a preprocessing phase.

[1]  Kathleen McKeown,et al.  Cut and Paste Based Text Summarization , 2000, ANLP.

[2]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[3]  Yasser El-Sonbaty,et al.  Finding Opinion Strength Using Rule-Based Parsing for Arabic Sentiment Analysis , 2013, MICAI.

[4]  Christopher D. Manning,et al.  Word Segmentation of Informal Arabic with Domain Adaptation , 2014, ACL.

[5]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  S.N. Saleh,et al.  A feature selection algorithm with redundancy reduction for text classification , 2007, 2007 22nd international symposium on computer and information sciences.

[8]  Nahla A. Belal,et al.  CBAS: context based arabic stemmer , 2015, ArXiv.

[9]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[10]  S. Alansary,et al.  Building an International Corpus of Arabic ( ICA ) : Progress of Compilation Stage , 2007 .

[11]  Otakar Smrz,et al.  ElixirFM – Implementation of Functional Arabic Morphology , 2007, SEMITIC@ACL.

[12]  Dalwadi Bijal,et al.  Overview of Stemming Algorithms for Indian and Non-Indian Languages , 2014, ArXiv.

[13]  John Hutchins,et al.  The first public demonstration of machine translation : the Georgetown-IBM system , 7 th January 1954 , 2006 .

[14]  Ani Nenkova,et al.  Automatic Text Summarization of Newswire: Lessons Learned from the Document Understanding Conference , 2005, AAAI.

[15]  Kareem Darwish,et al.  Building a Shallow Arabic Morphological Analyser in One Day , 2002, SEMITIC@ACL.

[16]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[17]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[18]  Yasser El-Sonbaty,et al.  Exploring the Effects of Root Expansion, Sentence Splitting and Ontology on Arabic Answer Selection , 2014 .

[19]  Yasser El-Sonbaty,et al.  ALQASIM: Arabic Language Question Answer Selection in Machines , 2013, CLEF.

[20]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[21]  Mohamed A. Ismail,et al.  Extraction of Arabic Words from Complex Color Image , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[22]  Yasser El-Sonbaty,et al.  Exploring the Effects of Word Roots for Arabic Sentiment Analysis , 2013, IJCNLP.

[23]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[24]  Karin C. Ryding,et al.  A Reference Grammar of Modern Standard Arabic , 2005 .

[25]  W. J. Hutchins,et al.  The Georgetown-IBM experiment demonstrated in January 1954 , 2004, AMTA.

[26]  Otakar Smrž Viktor Bielický Iveta Kouřilová Jakub Kráčmar Zemánek Dependency Treebank : A Word on the Million Words , 2008 .

[27]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[28]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[29]  Mohamed A. Ismail,et al.  Extraction of Arabic Words from Complex Color Image , 2007 .