Automatic Kurdish Text Classification Using KDC 4007 Dataset

Due to the large volume of text documents uploaded on the Internet daily. The quantity of Kurdish documents which can be obtained via the web increases drastically with each passing day. Considering news appearances, specifically, documents identified with categories, for example, health, politics, and sport appear to be in the wrong category or archives might be positioned in a nonspecific category called others. This paper is concerned with text classification of Kurdish text documents to placing articles or an email into its right class per their contents. Even though there are considerable numbers of studies directed on text classification in other languages, and the quantity of studies conducted in Kurdish is extremely restricted because of the absence of openness, and convenience of datasets. In this paper, a new dataset named KDC-4007 that can be widely used in the studies of text classification about Kurdish news and articles is created. KDC-4007 dataset its file formats are compatible with well-known text mining tools. Comparisons of three best-known algorithms (such as Support Vector Machine (SVM), Naive Bays (NB) and Decision Tree (DT) classifiers) for text classification and TF × IDF feature weighting method are evaluated on KDC-4007. The paper also studies the effects of utilizing Kurdish stemmer on the effectiveness of these classifiers. The experimental results indicate that the good accuracy value 91.03% is provided by the SVM classifier, especially when the stemming and TF × IDF feature weighting are involved in the preprocessing phase. KDC-4007 datasets are available publicly and the outcome of this study can be further used in future as a baseline for evaluations with other classifiers by other researchers.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Dzejla Medjedovic,et al.  Automatic Kurdish Dialects Identification , 2016, ICIT 2016.

[3]  Amira M. Idrees,et al.  Documents Emotions Classification Model Based on TF-IDF Weighting Measure , 2016 .

[4]  Kyumars Sheykh Esmaili,et al.  Building a Test Collection for Sorani Kurdish , 2013, 2013 ACS International Conference on Computer Systems and Applications (AICCSA).

[5]  Adel Hamdan Mohammad,et al.  Arabic Text Categorization Using Support vector machine, Naïve Bayes and Neural Network , 2016 .

[6]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[7]  N. Omar,et al.  Automatic Kurdish Sorani text categorization using N-gram based model , 2012, 2012 International Conference on Computer & Information Science (ICCIS).

[8]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[9]  Abraham Kandel,et al.  Multi-lingual Detection of Terrorist Content on the Web , 2006, WISI.

[10]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[11]  Tarik A. Rashid,et al.  Kurdish stemmer pre-processing steps for improving information retrieval , 2018, J. Inf. Sci..

[12]  Divakar Singh,et al.  A SURVEY REPORT ON TEXT CLASSIFICATION WITH DIFFERENT TERM WEIGHING METHODS AND COMPARISON BETWEEN CLASSIFICATION ALGORITHMS , 2013 .

[13]  Mahmoud Al-Ayyoub,et al.  Automatic Arabic text categorization: A comprehensive comparative study , 2015, J. Inf. Sci..

[14]  Izzat Alsmadi,et al.  The Effect of Stemming on Arabic Text Classification: An Empirical Study , 2011, Int. J. Inf. Retr. Res..

[15]  Julian Szymański,et al.  Comparative Analysis of Text Representation Methods Using Classification , 2014, Cybern. Syst..

[16]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[17]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.