Dimensional Testing for Multi-step Similarity Search

In data mining applications such as subspace clustering or feature selection, changes to the underlying feature set can require the reconstruction of search indices to support fundamental data mining tasks. For such situations, multi-step search approaches have been proposed that can accommodate changes in the underlying similarity measure without the need to rebuild the index. In this paper, we present a heuristic multi-step search algorithm that utilizes a measure of intrinsic dimension, the generalized expansion dimension (GED), as the basis of its search termination condition. Compared to the current state-of-the-art method, experimental results show that our heuristic approach is able to obtain significant improvements in both the number of candidates and the running time, while losing very little in the accuracy of the query results.

[1]  Ira Assent,et al.  Approximation Techniques for Indexing the Earth Mover’s Distance in Multimedia Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[3]  Hisashi Kashima,et al.  Generalized Expansion Dimension , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[4]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[5]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[6]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[7]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..

[11]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[12]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[13]  Dimitrios Gunopulos,et al.  Indexing Multidimensional Time-Series , 2004, The VLDB Journal.

[14]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[15]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[16]  John Riedl,et al.  Application of Dimensionality Reduction in Recommender System - A Case Study , 2000 .

[17]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[18]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[19]  Alexandr Andoni,et al.  Corrigendum to "efficient similarity search and classification via rank aggregation" by Ronald Fagin, Ravi Kumar and D. Sivakumar (proc. SIGMOD'03) , 2008, SIGMOD Conference.

[20]  Sang-goo Lee,et al.  An effective document clustering method using user-adaptable distance metrics , 2002, SAC '02.

[21]  Hans-Peter Kriegel,et al.  Efficient User-Adaptable Similarity Search in Large Multimedia Databases , 1997, VLDB.