A Robust Categorization System for Kurdish Sorani Text Documents

Text classification is a process of automatically assigning sets of documents into class labels depending on their data contents. It is also considered as an important element in the management of tasks and organizing information. Seemingly, the text classification process depends hugely on the quality of preprocessing steps. Materials and Methods: In this study, a novel pre-processing method (Normalizing, stemming, removing stopwords and removing non-Kurdish texts and symbols) was evaluated by means of comparing the performance of two text classification techniques, namely; decision tree (C4.5) classifier and Support Vector Machine (SVM) classifier. Two automatic learning algorithms for text categorization were compared using a set of Kurdish Sorani text documents that was collected from different Kurdish websites. The set of documents falls into 8 main categories namely: Sports, religions, arts, economics, educations, socials, styles and health. A set of preprocessing steps was performed on text documents such as normalizing some characters, stemming, removing stopwords and removing non-Kurdish texts and symbols, next, the documents were changed into a appropriate file format and finally the classification was conducted. Results: The findings of this study illustrated that the highest accuracy value 93.1% and the smallest time taken to building classifier was achieved with the SVM classifier after pre-processing and feature weighting steps were performed. Conclusion: The experimental results of this study can be utilized in future as a baseline to compare with other classifiers and Kurdish stemmers.