论文信息 - Evaluation of LSA performance in Spanish using multiple corpus of text

Evaluation of LSA performance in Spanish using multiple corpus of text

Latent Semantic Analysis is a natural language processing tools that allows estimating semantic distance between terms. The suc- cess of LSA is mainly based on the training corpus choice, which have been studied principally in English. This study focuses on studying LSA with regional Spanish corpus and evaluate the performance by identifying synonyms. We found that performance was slightly better than chance, concordantly with previous results. Standard LSA method cannot dy- namically increase the training corpus. By using classiers we combined multiple LSA models and showed that the use of automatic classiers increase the performance.

Mariano Sigman | Guillermo A. Cecchi | Facundo Carrillo | Diego Fernández Slezak

[1] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[2] Peter W. Foltz,et al. An introduction to latent semantic analysis , 1998 .

[3] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4] T. Landauer,et al. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[5] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.