A Comparison of Support Vector Machines and Self-Organizing Maps for e-Mail Categorization

This paper reports on experiments in multi-class document categorization with support vector machines and self-organizing maps. A data set consisting of personal e-mail messages is used for the experiments. Two distinct document representation formalisms are employed to characterize these messages, namely a standard word-based approach and a character n-gram document representation. Based on these document representations, the categorization performance of both machine learning approaches is assessed and a comparison is given.

[1]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[2]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[3]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[5]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[6]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[7]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[8]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  Elias Pampalk,et al.  Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps , 2002, ICANN.

[11]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[12]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[13]  Helmut Berger,et al.  A Comparison of Text-Categorization Methods Applied to N-Gram Frequency Statistics , 2004, Australian Conference on Artificial Intelligence.

[14]  James A. Hendler,et al.  Reputation Network Analysis for Email Filtering , 2004, CEAS.

[15]  M. V. Velzen,et al.  Self-organizing maps , 2007 .