Support Vector Machines for Text Categorization Based on Latent Semantic Indexing

Text Categorization(TC) is an important component in many information organization and information management tasks. Two key issues in TC are feature coding and classifier design. In this paper Text Categorization via Support Vector Machines(SVMs) approach based on Latent Semantic Indexing(LSI) is described. Latent Semantic Indexing[1][2] is a method for selecting informative subspaces of feature spaces with the goal of obtaining a compact representation of document. Support Vector Machines[3] are powerful machine learning systems, which combine remarkable performance with an elegant theoretical framework. The SVMs well fits the Text Categorization task due to the special properties of text itself. Experiments show that the LSI+SVMs frame improves clustering performance by focusing attention of Support Vector Machines onto informative subspaces of the feature spaces.

[1]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[2]  Norbert Fuhr,et al.  AIR/X - A rule-based multistage indexing system for Iarge subject fields , 1991, RIAO.

[3]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[4]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[5]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[6]  Christopher G. Atkeson,et al.  Extreme dimensionality reduction for text learning: cluster-generated feature spaces , 2000 .

[7]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[8]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[9]  Michael W. Berry,et al.  SVDPACKC (Version 1.0) User''s Guide , 1993 .

[10]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[11]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[12]  Jerome R. Bellegarda,et al.  A multispan language modeling framework for large vocabulary speech recognition , 1998, IEEE Trans. Speech Audio Process..

[13]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[14]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[15]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.