A Knowledge-Poor Approach to Turkish Text Categorization

Document categorization is a way of determining a category for a given document. Supervised methods mostly rely on a training data and rich linguistic resources that are either language-specific or generic. This study proposes a knowledge-poor approach to text categorization without using any sets of rules or language specific resources such as part-of-speech tagger or shallow parser. Knowledge-poor here refers to lack of a reasonable amount of background knowledge. The proposed system architecture takes data as-is and simply separates tokens by space. Documents represented in vector space models are used as training data for many machine learning algorithm. We empirically examined and compared a several factors from similarity metrics to learning algorithms in a variety of experimental setups. Although researchers believe that some particular classifiers or metrics are better than others for text categorization, the recent studies disclose that the ranking of the models purely depends on the class, experimental setup and domain as well. The study features extensive evaluation, comparison within a variety of experiments. We evaluate models and similarity metrics for Turkish language as one of the agglutinative language especially within poor-knowledge framework. It is seen that output of the study would be very beneficial for other studies.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[3]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[4]  Hiroshi Ogura,et al.  Comparison of metrics for feature selection in imbalanced text classification , 2011, Expert Syst. Appl..

[5]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  A Guran,et al.  TURKISH TEXT CATEGORIZATION USING N-GRAM WORD , 2009 .

[9]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[10]  Chu-Ren Huang,et al.  A Framework of Feature Selection Methods for Text Categorization , 2009, ACL.

[11]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[12]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[13]  Meng Chang Chen,et al.  Using chi-square statistics to measure similarities for text categorization , 2011, Expert Syst. Appl..

[14]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[15]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[16]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[17]  Elisabeth Métais,et al.  Natural language interfaces : what's the problem? -a data-driven quantitative analysis , 2010 .

[18]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[19]  Wei-Ying Ma,et al.  Improving text classification using local latent semantic indexing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[20]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[21]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[22]  Banu Diri,et al.  Automatic Turkish Text Categorization in Terms of Author, Genre and Gender , 2006, NLDB.

[23]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[24]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[25]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Wongkot Sriurai,et al.  IMPROVING TEXT CATEGORIZATION BY USING A TOPIC MODEL , 2011 .

[27]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.