Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?

The choice of the kernel function is crucial to most applications of support vector machines. In this paper, however, we show that in the case of text classification, term-frequency transformations have a larger impact on the performance of SVM than the kernel itself. We discuss the role of importance-weights (e.g. document frequency and redundancy), which is not yet fully understood in the light of model complexity and calculation cost, and we show that time consuming lemmatization or stemming can be avoided even when classifying a highly inflectional language like German.

[1]  Eugene L. Margulis,et al.  Modelling Documents with Multiple Poisson Distributions , 1993, Inf. Process. Manag..

[2]  Gabriel Altmann,et al.  Towards a Theory of Word Length Distribution , 1994, J. Quant. Linguistics.

[3]  Gabriel Altmann Wiederholungen in Texten , 1988 .

[4]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing , 1975, J. Am. Soc. Inf. Sci..

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[7]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[8]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9]  Wolfgang Lezius,et al.  A Freely Available Morphological Analyzer, Disambiguator and Context Sensitive Lemmatizer for German , 1998, ACL.

[10]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[11]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[12]  Stephen P. Harter,et al.  A probabilistic approach to automatic keyword indexing , 1974 .

[13]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[14]  Burghard B. Rieger,et al.  Semiotics and Computational Linguistics On Semiotic Cognitive Information Processing , 1999 .

[15]  G. Zipf,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. , 1949 .

[16]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[17]  S. Naranan,et al.  Quantitative Linguistics and Complex System Studies , 1996, J. Quant. Linguistics.

[18]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[19]  Rudiger Grotjahn Ein statistisches Modell für die Verteilung der Wortlänge , 1982 .

[20]  Ju. K. Krylov A Stationary Model of Coherent Text Generation , 1995, J. Quant. Linguistics.

[21]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[22]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.