论文信息 - Syntax versus Semantics: Analysis of Enriched Vector Space Models

Syntax versus Semantics: Analysis of Enriched Vector Space Models

This paper presents a robust method for the construction of collection-specific document models. These document models are variants of the well-known vector space model, which relies on a process of selecting, modifying, and weighting index terms with respect to a given document collection. We improve the step of index term selection by applying statistical methods for concept identification. This approach is particularly suited for post-retrieval categorization and retrieval tasks in closed collections, which is typical for intranet search. We compare our approach to “enriched” vector-space-based document models that employ knowledge of the underlying language in the form of external semantic concepts. Primary objective is to quantify the impact of a purely syntactic analysis in contrast to a semantic enrichment in the index construction step. As a by-product we provide an efficient and language-independent means for vector space model construction, whereas the resulting document models perform better than the standard vector space model.

Benno Stein | Martin Potthast | Sven Meyer zu Eissen | Martin Potthast | Benno Stein

[1] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3] L. R. Rasmussen,et al. In information retrieval: data structures and algorithms , 1992 .

[4] Andreas Rauber,et al. Integrating automatic genre analysis into digital libraries , 2001, JCDL '01.

[5] Benno Stein,et al. On Cluster Validity and the Information Need of Users , 2003 .

[6] Mark Stevenson,et al. The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[7] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[8] Susan T. Dumais,et al. Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[9] Santosh S. Vempala,et al. Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[10] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[11] Efstathios Stamatatos,et al. Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[12] Johannes Fürnkranz,et al. A Study Using $n$-gram Features for Text Categorization , 1998 .

[13] Ricardo Baeza-Yates,et al. Information Retrieval: Data Structures and Algorithms , 1992 .

[14] Michael E. Lesk,et al. Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[15] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[16] William B. Frakes. Term Conflation for Information Retrieval , 1984, SIGIR.

[17] Steffen Staab,et al. WordNet improves text document clustering , 2003, SIGIR 2003.