Syntax versus Semantics: Analysis of Enriched Vector Space Models

This paper presents a robust method for the construction of collection-specific document models. These document models are variants of the well-known vector space model, which relies on a process of selecting, modifying, and weighting index terms with respect to a given document collection. We improve the step of index term selection by applying statistical methods for concept identification. This approach is particularly suited for post-retrieval categorization and retrieval tasks in closed collections, which is typical for intranet search. We compare our approach to “enriched” vector-space-based document models that employ knowledge of the underlying language in the form of external semantic concepts. Primary objective is to quantify the impact of a purely syntactic analysis in contrast to a semantic enrichment in the index construction step. As a by-product we provide an efficient and language-independent means for vector space model construction, whereas the resulting document models perform better than the standard vector space model.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[3]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[4]  Andreas Rauber,et al.  Integrating automatic genre analysis into digital libraries , 2001, JCDL '01.

[5]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[6]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[7]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[8]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[9]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[10]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[11]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[12]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[13]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[14]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[15]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[16]  William B. Frakes Term Conflation for Information Retrieval , 1984, SIGIR.

[17]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.