Automated arabic text classification with P‐Stemmer, machine learning, and a tailored news article taxonomy

Arabic news articles in electronic collections are difficult to study. Browsing by category is rarely supported. Although helpful machine‐learning methods have been applied successfully to similar situations for English news articles, limited research has been completed to yield suitable solutions for Arabic news. In connection with a Qatar National Research Fund (QNRF)‐funded project to build digital library community and infrastructure in Qatar, we developed software for browsing a collection of about 237,000 Arabic news articles, which should be applicable to other Arabic news collections. We designed a simple taxonomy for Arabic news stories that is suitable for the needs of Qatar and other nations, is compatible with the subject codes of the International Press Telecommunications Council, and was enhanced with the aid of a librarian expert as well as five Arabic‐speaking volunteers. We developed tailored stemming (i.e., a new Arabic light stemmer called P‐Stemmer) and automatic classification methods (the best being binary Support Vector Machines classifiers) to work with the taxonomy. Using evaluation techniques commonly used in the information retrieval community, including 10‐fold cross‐validation and the Wilcoxon signed‐rank test, we showed that our approach to stemming and classification is superior to state‐of‐the‐art techniques.

[1]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[2]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[3]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[4]  Martin L. King,et al.  Towards a Methodology for Building Ontologies , 1995 .

[5]  Ismail Hmeidi,et al.  Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents , 1997, J. Am. Soc. Inf. Sci..

[6]  Martha Evens,et al.  Discovering Lexical Information by Tagging Arabic Newspaper Text , 1998, SEMITIC@COLING.

[7]  R. Al Shalabi,et al.  New approach for extracting Arabic roots , 2003 .

[8]  Riyad Al-Shalabi,et al.  Comparison between Ad-hoc Retrieval and Filtering Retrieval Using Arabic Documents , 2004, Int. J. Comput. Process. Orient. Lang..

[9]  Luis Sánchez-Fernández,et al.  Building an Ontology for NEWS � Applications , 2004 .

[10]  Asunción Gómez-Pérez,et al.  Ontology Evaluation , 2004, Handbook on Ontologies.

[11]  Marko Grobelnik,et al.  A SURVEY OF ONTOLOGY EVALUATION TECHNIQUES , 2005 .

[12]  Alaa M. El-Halees Mining Arabic Association Rules for Text Classification , 2006 .

[13]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[14]  Margaret E. Connell,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[15]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[16]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[17]  D. Powers,et al.  Automatic thesaurus construction , 2008, ACSC.

[18]  Riyad Al-Shalabi,et al.  Building an effective rule-based light stemmer for Arabic language to inprove search effectiveness , 2008, 2008 International Conference on Innovations in Information Technology.

[19]  Riyad Al-Shalabi,et al.  A comparison of text-classification techniques applied to Arabic text , 2009, J. Assoc. Inf. Sci. Technol..

[20]  Sameh H. Ghwanmeh,et al.  Enhanced Algorithm for Extracting the Root of Arabic Words , 2009, 2009 Sixth International Conference on Computer Graphics, Imaging and Visualization.

[21]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[22]  Yiyu Yao,et al.  Evaluating information retrieval system performance based on user preference , 2010, Journal of Intelligent Information Systems.

[23]  Charu C. Aggarwal,et al.  Mining Text Data , 2012 .

[24]  Zakaria Elberrichi,et al.  Arabic text categorization: a comparative study of different representation modes , 2012, Int. Arab J. Inf. Technol..

[25]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[26]  Mohammed N. Al-Kabi,et al.  Towards improving Khoja rule-based Arabic stemmer , 2013, 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT).

[27]  Mohammed A. Otair COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMS , 2013 .

[28]  Belal Abu Ata,et al.  A rule-based stemmer for Arabic Gulf dialect , 2015, J. King Saud Univ. Comput. Inf. Sci..

[29]  Izzat Alsmadi,et al.  A novel root based Arabic stemmer , 2015, J. King Saud Univ. Comput. Inf. Sci..

[30]  Ibrahim Abu El-Khair,et al.  Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study , 2017, ArXiv.