A Quantitative Evalution of the Enhanced Topic-based Vector Space Model

This contribution presents a quantitative evaluation procedure for Information Retrieval models and the results of this procedure applied on the enhanced Topic-based Vector Space Model (eTVSM). Since the eTVSM is an ontology-based model, its effectiveness heavily depends on the quality of the underlaying ontology. Therefore the model has been tested with different ontologies to evaluate the impact of those ontologies on the effectiveness of the eTVSM. On the highest level of abstraction, the following results have been observed during our evaluation: First, the theoretically deduced statement that the eTVSM has a similar effecitivity like the classic Vector Space Model if a trivial ontology (every term is a concept and it is independet of any other concepts) is used has been approved. Second, we were able to show that the effectiveness of the eTVSM raises if an ontology is used which is only able to resolve synonyms. We were able to derive such kind of ontology automatically from the WordNet ontology. Third, we observed that more powerful ontologies automatically derived from the WordNet, dramatically dropped the effectiveness of the eTVSM model even clearly below the effectiveness level of the Vector Space Model. Fourth, we were able to show that a manually created and optimized ontology is able to raise the effectiveness of the eTVSM to a level which is clearly above the best effectiveness levels we have found in the literature for the Latent Semantic Index model with compareable document sets.

[1]  Barry Smith,et al.  Handbook of Metaphysics and Ontology. , 1991 .

[2]  W. Bruce Croft,et al.  A Comparison of Text Retrieval Models , 1992, Comput. J..

[3]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986 .

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Jonathan L. Gross,et al.  Handbook of graph theory , 2007, Discrete mathematics and its applications.

[6]  Kenneth Ward Church One term or two? , 1995, SIGIR '95.

[7]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[8]  Darrell Laham,et al.  Latent Semantic Analysis Approaches to Categorization , 1997 .

[9]  Chris D. Paice,et al.  Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[10]  M. O'Mahony Sensory Evaluation of Food: Statistical Methods and Procedures , 1986 .

[11]  Christos G. Cassandras,et al.  Discrete event systems : modeling and performance analysis , 1993 .

[12]  Jay L. Devore,et al.  Introduction to Statistics and Data Analysis (with ThomsonNOW Printed Access Card) , 2007 .

[13]  Harry Bunt,et al.  Mass Terms and Model-Theoretic Semantics , 1985 .

[14]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[15]  Peter P. Chen The entity-relationship model: toward a unified view of data , 1975, VLDB '75.

[16]  Shmuel T. Klein,et al.  Detecting Content-Bearing Words by Serial Clustering. , 1995, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[17]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[18]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[19]  Robert D. van Valin,et al.  An Introduction to Syntax , 2001 .

[20]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[21]  L. Floridi Blackwell Guide to the Philosophy of Computing and Information , 2003 .

[22]  Debapriyo Majumdar,et al.  Why spectral retrieval works , 2005, SIGIR '05.

[23]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[24]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[25]  Paolo Rosso,et al.  Text Categorization and Information Retrieval Using WordNet Senses , 2004 .

[26]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[27]  Dik Lun Lee,et al.  Document Ranking and the Vector-Space Model , 1997, IEEE Softw..

[28]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[29]  J. Becker,et al.  Topic-based Vector Space Model , 2003 .

[30]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[31]  James Allan,et al.  Automatic Retrieval With Locality Information Using SMART , 1992, TREC.

[32]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[33]  Nigel Collier,et al.  A Combined Query Expansion Approach for Information Retrieval , 1999 .

[34]  Robert D. Rodman,et al.  An Introduction to Language , 1984 .