论文信息 - Fast Extraction of Semantic Features from a Latent Semantic Indexed Text Corpus

Fast Extraction of Semantic Features from a Latent Semantic Indexed Text Corpus

This paper proposes a projection-based symmetrical factorisation method for extracting semantic features from collections of text documents stored in a Latent Semantic space. Preliminary experimental results demonstrate this yields a comparable representation to that provided by a novel probabilistic approach which reconsiders the entire indexing problem of text documents and works directly in the original high dimensional vector-space representation of text. The employed projection index is derived here from the a priori constraints on the problem. The principal advantage of this approach is computational efficiency and is obtained by the exploitation of the Latent Semantic Indexing as a preprocessing stage. Simulation results on subsets of the 20-Newsgroups text corpus in various settings are provided.

Ata Kabán | Mark A. Girolami

[1] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[2] Slava M. Katz. Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[3] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[4] Aapo Hyvärinen,et al. A Fast Fixed-Point Algorithm for Independent Component Analysis , 1997, Neural Computation.

[5] L. K. Hansen,et al. Independent Components in Text , 2000 .

[6] Daphne Koller,et al. Using machine learning to improve information access , 1998 .

[7] Thomas Hofmann,et al. Probabilistic Latent Semantic Analysis , 1999, UAI.

[8] Michael W. Berry,et al. Large-Scale Sparse Singular Value Computations , 1992 .

[9] Santosh S. Vempala,et al. Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.