Support Vector Machines based on a semantic kernel for text categorization

We propose to solve a text categorization task using a new metric between documents, based on a priori semantic knowledge about words. This metric can be incorporated into the definition of radial basis kernels of Support Vector Machines or directly used in a K-nearest neighbors algorithm. Both SVM and KNN are tested and compared on the 20-newsgroups database. Support Vector Machines provide the best accuracy on test data.

[1]  Kristian J. Hammond,et al.  Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System , 1997, AI Mag..

[2]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[3]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[4]  Massih-Reza Amini,et al.  Stochastic models for surface information extraction in texts , 1999 .

[5]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[7]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[8]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[9]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[10]  R. C. Williamson,et al.  Classification on proximity data with LP-machines , 1999 .

[11]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[12]  Bernhard Schölkopf,et al.  Prior Knowledge in Support Vector Kernels , 1997, NIPS.

[13]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[14]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[15]  James T. Kwok,et al.  Automated Text Categorization Using Support Vector Machine , 1998, ICONIP.

[16]  Isabelle Moulinier Une approche de la categorisation de textes par l'apprentissage symbolique , 1996 .

[17]  Keinosuke Fukunaga,et al.  The optimal distance measure for nearest neighbor classification , 1981, IEEE Trans. Inf. Theory.