Automatic Kurdish Sorani text categorization using N-gram based model

N-gram Based Model for text categorization is applied for many languages, in particularly the Indo-European languages family. Regrettably, there is limit study found on applying the mentioned model for Kurdish Sorani Language. This paper presents the results of investigating N-gram frequency statistics technique to classify the Kurdish Sorani Unicode documents of online newspapers into their classes. The investigated technique generates the frequency profiles for the training and the test documents using N-gram word level 1 gram and character level (2, 3, 4, 5, 6, 7, and 8) grams as a text representation. Then, a similarity algorithm called “Dice measure of similarity” is employed in order to classify the documents. Results show that the character level (5 grams) gives better text representation which is led to achieve better text classification.