Range queries in natural language dictionaries with recursive lists of clusters

We evaluate the performance of range queries in the Recursive List of Clusters (RLC) metric data structure, when the metric spaces are natural language dictionaries with the Levenshtein distance. The study compares RLC with five data structures (GNAT, H-Dsatl, LAESA, LC, and vp-trees) and comprises six dictionaries. The natural language dictionaries (in English, French, German, Italian, Portuguese, and Spanish), are characterised according to the mean and the variance of the histograms of distances. The experimental results show that RLC has a good performance in all tested cases and, in some of them, it outperforms all the other data structures. In addition, RLC is the only data structure that always keeps its good performance, whether the space dimension is lower or higher, and whether the query radius is smaller or larger.

[1]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[2]  Pavel Zezula,et al.  Region proximity in metric spaces and its use for approximate similarity search , 2003, TOIS.

[3]  Ricardo A. Baeza-Yates,et al.  Fast approximate string matching in a dictionary , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[4]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[5]  Gonzalo Navarro,et al.  A compact space decomposition for effective metric indexing , 2005, Pattern Recognit. Lett..

[6]  Ricardo A. Baeza-Yates,et al.  Spaghettis: an array based algorithm for similarity queries in metric spaces , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[7]  Margarida Mamede,et al.  Recursive Lists of Clusters: A Dynamic Data Structure for Range Queries in Metric Spaces , 2005, ISCIS.

[8]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[9]  Hanan Samet,et al.  Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .

[10]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[11]  Peter Yianilos,et al.  Excluded middle vantage point forests for nearest neighbor search , 1998 .

[12]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[13]  Gonzalo Navarro,et al.  Searching in metric spaces by spatial approximation , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[14]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[15]  BozkayaTolga,et al.  Distance-based indexing for high-dimensional metric spaces , 1997 .

[16]  Gonzalo Navarro Searching in metric spaces by spatial approximation , 2002, The VLDB Journal.

[17]  Luisa Micó,et al.  A fast branch & bound nearest neighbour classifier in metric spaces , 1996, Pattern Recognit. Lett..

[18]  Gonzalo Navarro,et al.  An effective clustering algorithm to index high dimensional metric spaces , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[19]  Gonzalo Navarro,et al.  Memory-Adaptive Dynamic Spatial Approximation Trees , 2003, SPIRE.

[20]  Gonzalo Navarro,et al.  Improved deletions in dynamic spatial approximation trees , 2003, 23rd International Conference of the Chilean Computer Science Society, 2003. SCCC 2003. Proceedings..

[21]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[22]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.