Compression-based arabic text classification

Text classification (TC) is one of the fundamental problems in text mining. Plenty of works exist on TC with interesting approaches and excellent results; however, most of these works follow a word-based approach for feature extraction. In this work, we are interested in an alternative (byte-based or character-based) approach known as compression-based TC (CTC). CTC has been used for some languages such as English and Portuguese and it is shown to have certain advantages/ disadvantages compared with word-based approaches. This work applies CTC on the Arabic language with the purpose of investigating whether these advantages/disadvantages exists for the Arabic language as well. The results are encouraging as they show the viability of using CTC for Arabic TC.

[1]  David J. Harper,et al.  Using compression based language models for text categorization. , 2003 .

[2]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.

[3]  Natheer Khasawneh,et al.  Feature reduction techniques for Arabic text categorization , 2009 .

[4]  Rehab Duwairi,et al.  Feature reduction techniques for Arabic text categorization , 2009, J. Assoc. Inf. Sci. Technol..

[5]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[6]  Motaz Saad,et al.  The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification , 2010 .

[7]  Ghassan Kanaan,et al.  A comparison of text-classification techniques applied to Arabic text , 2009 .

[8]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[9]  W. Teahan,et al.  Comment on "Language trees and zipping". , 2003, Physical review letters.

[10]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[11]  Saleh Alsaleem,et al.  Automated Arabic Text Categorization Using SVM and NB , 2011, Int. Arab. J. e Technol..

[12]  Dale Schuurmans,et al.  Language and Task Independent Text Categorization with Simple Language Models , 2003, NAACL.

[13]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[14]  Mahmoud Al-Ayyoub,et al.  Lexicon-based sentiment analysis of Arabic tweets , 2015, Int. J. Soc. Netw. Min..

[15]  Fouzi Harrag,et al.  Improving arabic text categorization using decision trees , 2009, 2009 First International Conference on Networked Digital Technologies.

[16]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[17]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[18]  Mahmoud Al-Ayyoub,et al.  Automatic Lexicon Construction for Arabic Sentiment Analysis , 2014, 2014 International Conference on Future Internet of Things and Cloud.

[19]  M. Hadni,et al.  A new and efficient stemming technique for Arabic Text Categorization , 2012, 2012 International Conference on Multimedia Computing and Systems.

[20]  Mahmoud Al-Ayyoub,et al.  On authorship authentication of Arabic articles , 2014, 2014 5th International Conference on Information and Communication Systems (ICICS).

[21]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[22]  Mohammad S. Khorsheed,et al.  Comparative evaluation of text classification techniques using a large diverse Arabic dataset , 2013, Language Resources and Evaluation.

[23]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[24]  Amine Bensaid,et al.  Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm , 2004 .

[25]  Alaa M. El-Halees,et al.  Arabic Text Classification Using Maximum Entropy , 2015 .

[26]  Mahmoud Al-Ayyoub,et al.  An analytical study of Arabic sentiments: Maktoob case study , 2013, 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013).

[27]  Nayer M. Wanas,et al.  A Study of Text Preprocessing Tools for Arabic Text Categorization , 2009 .

[28]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[29]  Mahmoud Al-Ayyoub,et al.  An extended analytical study of Arabic sentiments , 2014, Int. J. Big Data Intell..

[30]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[31]  Rehab Duwairi,et al.  Machine learning for Arabic text categorization , 2006, J. Assoc. Inf. Sci. Technol..

[32]  Laila Khreisat,et al.  Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[33]  Mahmoud Al-Ayyoub,et al.  Cross-Lingual Short-Text Document Classification for Facebook Comments , 2014, 2014 International Conference on Future Internet of Things and Cloud.

[34]  Moshe Koppel,et al.  Automatically Classifying Documents by Ideological and Organizational Affiliation , 2009, 2009 IEEE International Conference on Intelligence and Security Informatics.

[35]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[36]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[37]  . M.SikanderHayatKhiyal,et al.  Classification of Textual Documents Using Learning Vector Quantization , 2007 .

[38]  E. Caglioti,et al.  Benedetto, Caglioti, and Loreto Reply: , 2003 .

[39]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[40]  R. Ciupa,et al.  International Conference , 2023, In Vitro Cellular & Developmental Biology - Animal.

[41]  Dekang Lin,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2 , 2011 .

[42]  Ismail Hmeidi,et al.  Performance of KNN and SVM classifiers on full word Arabic articles , 2008, Adv. Eng. Informatics.

[43]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[44]  Joshua Goodman Extended Comment on Language Trees and Zipping , 2002, ArXiv.

[45]  Mahmoud Al-Ayyoub,et al.  Arabic sentiment analysis: Lexicon-based and corpus-based , 2013, 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT).

[46]  Nitin Thaper,et al.  Using compression for source-based classification of text , 2001 .

[47]  V Korde,et al.  TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY , 2012 .

[48]  Mahmoud Al-Ayyoub,et al.  An extensive study of the Bag-of-Words approach for gender identification of Arabic articles , 2014, 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA).

[49]  Fouzi Harrag,et al.  Stemming as a feature reduction technique for Arabic Text Categorization , 2011, 2011 10th International Symposium on Programming and Systems.

[50]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[51]  Claudia Leacock,et al.  Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications , 2008 .

[52]  Dmitry V. Khmelev,et al.  Using Markov Chains for Identification of Writer , 2001, Lit. Linguistic Comput..

[53]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[54]  E. Caglioti,et al.  On J. Goodman's comment to "Language Trees and Zipping" , 2002, cond-mat/0203275.

[55]  Chris Callison-Burch,et al.  The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content , 2011, ACL.

[56]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[57]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[58]  Abdelwadood Moh'd. Mesleh,et al.  Feature sub-set selection metrics for Arabic text classification , 2011, Pattern Recognit. Lett..