论文信息 - Supervised latent semantic indexing for document categorization

Supervised latent semantic indexing for document categorization

Latent semantic indexing (LSI) is a successful technology in information retrieval (IR) which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. However, LSI is not optimal for document categorization tasks because it aims to find the most representative features for document representation rather than the most discriminative ones. In this paper, we propose supervised LSI (SLSI) which selects the most discriminative basis vectors using the training data iteratively. The extracted vectors are then used to project the documents into a reduced dimensional space for better classification. Experimental evaluations show that the SLSI approach leads to dramatic dimension reduction while achieving good classification results.

[1] Rie Kubota Ando. Latent semantic space: iterative scaling improves precision of inter-document similarity measurement , 2000, SIGIR '00.

[2] Yiming Yang,et al. Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[3] Susan T. Dumais,et al. Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[4] Richard A. Harshman,et al. Indexing by latent semantic indexing analysis , 1990 .

[5] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.

[6] Haym Hirsh,et al. Using LSI for text classification in the presence of background text , 2001, CIKM '01.

[7] Wei-Ying Ma,et al. Locality preserving indexing for document representation , 2004, SIGIR '04.

[8] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[9] David A. Hull. Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[10] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11] Yiming Yang,et al. A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[12] George Karypis,et al. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[13] Thomas Hofmann,et al. Probabilistic latent semantic indexing , 1999, SIGIR '99.