TURKISH TEXT CATEGORIZATION USING N-GRAM WORD

An N-gram is a representation method that consists of a sequence of N-contiguous characters or words. There have been so many studies which use N-gram based representations for the traditional text classification tasks. In contrast to other languages, the studies in Turkish are limited. In this paper, we analyze text classification algorithms on a Turkish dataset by using N-gram words. We have compared several classifiers (Bayesian probabilistic classifiers, nearest neighbor classifiers and decision trees) using different types of features. We applied the classifiers on different data sets that are represented with unigram, bigram and trigram words. In the experiments, a total of 600 text documents that are assigned to six categories were used and the best success rate of 95.83% was achieved by using unigrams.

[1]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[2]  Levent Özgür,et al.  Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish , 2004, Pattern Recognit. Lett..

[3]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[4]  William D. Marslen-Wilson,et al.  Lexical Representation and Process , 1991 .

[5]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[6]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[7]  Andrew P. Sage,et al.  Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[11]  Banu Diri,et al.  Automatic Turkish Text Categorization in Terms of Author, Genre and Gender , 2006, NLDB.

[12]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[13]  Banu Diri,et al.  A New Feature Extraction Method for Text Classification , 2007, 2007 IEEE 15th Signal Processing and Communications Applications.

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Oya Kalipsiz,et al.  Advanced Information Extraction with n-gram based LSI , 2008 .