Text classification is a process of automatically assigning sets of documents into class labels depending on their data
contents. It is also considered as an important element in the management of tasks and organizing information. Seemingly, the text
classification process depends hugely on the quality of preprocessing steps. Materials and Methods: In this study, a novel pre-processing
method (Normalizing, stemming, removing stopwords and removing non-Kurdish texts and symbols) was evaluated by means of
comparing the performance of two text classification techniques, namely; decision tree (C4.5) classifier and Support Vector Machine (SVM)
classifier. Two automatic learning algorithms for text categorization were compared using a set of Kurdish Sorani text documents that
was collected from different Kurdish websites. The set of documents falls into 8 main categories namely: Sports, religions, arts, economics,
educations, socials, styles and health. A set of preprocessing steps was performed on text documents such as normalizing some characters,
stemming, removing stopwords and removing non-Kurdish texts and symbols, next, the documents were changed into a appropriate
file format and finally the classification was conducted. Results: The findings of this study illustrated that the highest accuracy value 93.1%
and the smallest time taken to building classifier was achieved with the SVM classifier after pre-processing and feature weighting steps
were performed. Conclusion: The experimental results of this study can be utilized in future as a baseline to compare with other classifiers
and Kurdish stemmers.
[1]
Stephen E. Robertson,et al.
Understanding inverse document frequency: on theoretical arguments for IDF
,
2004,
J. Documentation.
[2]
Gerard Salton,et al.
A vector space model for automatic indexing
,
1975,
CACM.
[3]
Bart Baesens,et al.
A Novel Profit Maximizing Metric for Measuring Classification Performance of Customer Churn Prediction Models
,
2013,
IEEE Transactions on Knowledge and Data Engineering.
[4]
Neeraj Kumar,et al.
An efficient scheme for automatic web pages categorization using the support vector machine
,
2016,
New Rev. Hypermedia Multim..