Arabic Text Categorization Using Logistic Regression

Several Text Categorization (TC) techniques and algorithms have been investigated in the limited research literature of Arabic TC. In this research, Logistic Regression (LR) is investigated in Arabic TC. To the best of our knowledge, LR was never used for Arabic TC before. Experiments are conducted on Aljazeera Arabic News (Alj-News) dataset. Arabic text-preprocessing takes place on this dataset to handle the special nature of Arabic text. Experimental results of this research prove that the LR classifier is a competitive Arabic TC algorithm to the state of the art ones in this field; it has recorded a precision of 96.5% on one category and above 90% for 3 categories out of the five categories of Alj-News dataset. Regarding the overall performance, LR has recorded a macroaverage precision of 87%, recall of 86.33% and F- measure of 86.5%.

[1]  Eyke Hüllermeier,et al.  Combining Instance-Based Learning and Logistic Regression for Multilabel Classification , 2009, ECML/PKDD.

[2]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[3]  Rehab Duwairi,et al.  Educative and Adaptive System for Personalized Learning: Learning Styles and Content Adaptation , 2007 .

[4]  Riyad Al-Shalabi,et al.  A comparison of text-classification techniques applied to Arabic text , 2009, J. Assoc. Inf. Sci. Technol..

[5]  Mayy M. Al-Tahrawi The Role of Rare Terms in Enhancing the Performance of Polynomial Networks Based Text Categorization , 2013 .

[6]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[7]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[8]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[9]  Arild Brandrud Næss,et al.  Bayesian Text Categorization , 2007 .

[10]  Jafar Ababneh,et al.  Vector Space Models to Classify Arabic Text , 2014 .

[11]  Amrita Paul,et al.  Effect of imbalanced data on document classification algorithms , 2014 .

[12]  Fadi Thabtah,et al.  Naïve Bayesian Based on Chi Square to Categorize Arabic Data , 2009 .

[13]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[14]  Yiming Yang,et al.  Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization , 2003, ICML.

[15]  Alaa El-Halees,et al.  A Comparative Study on Arabic Text Classification , 2008, Egypt. Comput. Sci. J..

[16]  Mayy M. Al-Tahrawi,et al.  Arabic text classification using Polynomial Networks , 2015, J. King Saud Univ. Comput. Inf. Sci..

[17]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[18]  Ron Bekkerman,et al.  Distributional clustering of words for text categorization , 2003 .

[19]  Allam Appa Rao,et al.  Performance Comparative in Classification Algorithms Using Real Datasets , 2009 .

[20]  Mayy M. Al-Tahrawi CLASS-BASED AGGRESSIVE FEATURE SELECTION FOR POLYNOMIAL NETWORKS TEXT CLASSIFIERS – AN EMPIRICAL STUDY , 2015 .

[21]  Siham Ouamour,et al.  Theme Classification of Arabic Text: A Statistical Approach , 2014 .

[22]  Fouzi Harrag,et al.  Improving arabic text categorization using decision trees , 2009, 2009 First International Conference on Networked Digital Technologies.

[23]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[24]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[25]  Alaa M. El-Halees,et al.  Arabic Text Classification Using Maximum Entropy , 2015 .

[26]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[27]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[28]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[29]  Anestis Antoniadis,et al.  A sparse version of the ridge logistic regression for large-scale text categorization , 2011, Pattern Recognit. Lett..

[30]  Norbert Fuhr,et al.  Combining model-oriented and description-oriented approaches for probabilistic indexing , 1991, SIGIR '91.

[31]  Nazlia Omar,et al.  An automated arabic text categorization based on the frequency ratio accumulation , 2014, Int. Arab J. Inf. Technol..

[32]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[33]  Rehab Duwairi A Distance-based Classifier for Arabic Text Categorization , 2005, DMIN.

[34]  David W. Corne,et al.  Feature subset selection for Arabic document categorization using BPSO-KNN , 2011, 2011 Third World Congress on Nature and Biologically Inspired Computing.

[35]  Paul Komarek Fast Logistic Regression for Data Mining , Text Classification and Link Detection , 2003 .

[36]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[37]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[38]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[39]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[40]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[41]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[42]  Mohammad S. Khorsheed,et al.  Comparative evaluation of text classification techniques using a large diverse Arabic dataset , 2013, Language Resources and Evaluation.

[43]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[44]  Alexander Genkin,et al.  Sparse Logistic Regression for Text Categorization , 2005 .

[45]  Sameh Ghwanmeh Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language , 2007 .

[46]  Nayer M. Wanas,et al.  A Study of Text Preprocessing Tools for Arabic Text Categorization , 2009 .

[47]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[48]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[49]  Gerhard Weikum,et al.  Fast logistic regression for text categorization with variable-length n-grams , 2008, KDD.

[50]  Rehab Duwairi,et al.  Arabic Text Categorization , 2007, Int. Arab J. Inf. Technol..

[51]  Saleh Alsaleem,et al.  Automated Arabic Text Categorization Using SVM and NB , 2011, Int. Arab. J. e Technol..

[52]  Andrew W. Moore,et al.  Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs , 2003, AISTATS.

[53]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[54]  Mayy M. Al-Tahrawi The Significance Of Low Frequent Terms in Text Classification , 2014, Int. J. Intell. Syst..

[55]  B. Nagalakshmi,et al.  Machine Learning Algorithms in Web Page Classification , 2015 .

[56]  David D. Lewis,et al.  Text categorization of low quality images , 1995 .

[57]  Ahmed Guessoum,et al.  A hybrid BSO-Chi2-SVM approach to Arabic text categorization , 2013, 2013 ACS International Conference on Computer Systems and Applications (AICCSA).

[58]  Fredric C. Gey,et al.  Probabilistic retrieval based on staged logistic regression , 1992, SIGIR '92.

[59]  Rong Jin,et al.  Large-scale text categorization by batch mode active learning , 2006, WWW '06.

[60]  Yaxin Bi,et al.  Intelligent Systems and Applications , 2016 .

[61]  Raed Abu Zitar,et al.  Polynomial Networks versus Other Techniques in Text Categorization , 2008, Int. J. Pattern Recognit. Artif. Intell..

[62]  Eric Atwell,et al.  Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers , 2008, COLING.