Analysis of preprocessing methods on classification of Turkish texts

Preprocessing is an important task and critical step in information retrieval and text mining. The objective of this study is to analyze the effect of preprocessing methods in text classification on Turkish texts. We compiled two large datasets from Turkish newspapers using a crawler. On these compiled data sets and using two additional datasets, we perform a detailed analysis of preprocessing methods such as stemming, stopword filtering and word weighting for Turkish text classification on several different Turkish datasets. We report the results of extensive experiments.

[1]  A Guran,et al.  TURKISH TEXT CATEGORIZATION USING N-GRAM WORD , 2009 .

[2]  Akiko Aizawa Linguistic Techniques to Improve the Performance of Automatic Text Categorization , 2001, NLPRS.

[3]  Levent Özgür,et al.  Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish , 2004, Pattern Recognit. Lett..

[4]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[5]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[6]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[7]  Fazli Can,et al.  Information retrieval on Turkish texts , 2008 .

[8]  Donna Harman,et al.  How effective is suffixing , 1991 .

[9]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[10]  Zehra Cataltepe,et al.  Turkish Document Classification Using Shorter Roots , 2007, 2007 IEEE 15th Signal Processing and Communications Applications.

[11]  Banu Diri,et al.  Automatic Turkish Text Categorization in Terms of Author, Genre and Gender , 2006, NLDB.

[12]  Jaana Kekäläinen,et al.  Indexing strategies for Swedish full text retrieval under different user scenarios , 2007, Inf. Process. Manag..

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  Tunga Güngör,et al.  Time-efficient spam e-mail filtering using n-gram models , 2008, Pattern Recognit. Lett..

[15]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[16]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[17]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[18]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[19]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[20]  Natheer Khasawneh,et al.  Feature reduction techniques for Arabic text categorization , 2009 .