Metric Indexing for the Vector Model in Text Retrieval

In the area of Text Retrieval, processing a query in the vector model has been verified to be qualitatively more effective than searching in the boolean model. However, in case of the classic vector model the current methods of processing many-term queries are inefficient, in case of LSI model there does not exist an efficient method for processing even the few-term queries. In this paper we propose a method of vector query processing based on metric indexing, which is efficient especially for the LSI model. In addition, we propose a concept of approximate semi-metric search, which can further improve the efficiency of retrieval process. Results of experiments made on moderate text collection are included.

[1]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[2]  Dik Lun Lee,et al.  Document ranking on weight-partitioned signature files , 1996, TOIS.

[3]  Václav Snásel,et al.  Revisiting M-Tree Building Principles , 2003, ADBIS.

[4]  Paul Corazza,et al.  INTRODUCTION TO METRIC-PRESERVING FUNCTIONS , 1999 .

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Vaclav Snasel,et al.  Vector Query with Signature Filtering , 2003 .

[7]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[8]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[9]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[10]  Michael Persin,et al.  Document filtering for fast ranking , 1994, SIGIR '94.

[11]  Uwe Deppisch,et al.  S-tree: a dynamic balanced signature index for office retrieval , 1986, SIGIR '86.

[12]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[13]  Stephen Blott,et al.  An Approximation- Based Data Structure for Similarity Search , 2006 .

[14]  Gonzalo Navarro,et al.  A Probabilistic Spell for the Curse of Dimensionality , 2001, ALENEX.

[15]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[16]  Alistair Moffat,et al.  Fast ranking in limited space , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[17]  Alistair Moffat,et al.  Vector-space ranking with effective early termination , 2001, SIGIR '01.

[18]  Marco Patella Similarity Search in Multimedia Databases , 1999 .

[19]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 2005 .

[20]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.