Easing the Dimensionality Curse by Stretching Metric Spaces

Queries over sets of complex elements are performed extracting features from each element, which are used in place of the real ones during the processing. Extracting a large number of significant features increases the representative power of the feature vector and improves the query precision. However, each feature is a dimension in the representation space, consequently handling more features worsen the dimensionality curse. The problem derives from the fact that the elements tends to distribute all over the space and a large dimensionality allows them to spread over much broader spaces. Therefore, in high-dimensional spaces, elements are frequently farther from each other, so the distance differences among pairs of elements tends to homogenize. When searching for nearest neighbors, the first one is usually not close, but as long as one is found, small increases in the query radius tend to include several others. This effect increases the overlap between nodes in access methods indexing the dataset. Both spatial and metric access methods are sensitive to the problem. This paper presents a general strategy applicable to metric access methods in general, improving the performance of similarity queries in high dimensional spaces. Our technique applies a function that "stretches" the distances. Thus, close objects become closer and far ones become even farther. Experiments using the metric access method Slim-tree show that similarity queries performed in the transformed spaces demands up to 70% less distance calculations, 52% less disk access and reduces up to 57% in total time when comparing with the original spaces.

[1]  Charalambos Strouthopoulos,et al.  Adaptive color reduction , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[2]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[3]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[4]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[5]  Carlos Alberto Heuser,et al.  Twisting the Metric Space to Achieve Better Metric Trees , 2004, SBBD.

[6]  Christos Faloutsos,et al.  Fast Indexing and Visualization of Metric Data Sets using Slim-Trees , 2002, IEEE Trans. Knowl. Data Eng..

[7]  J.S. Jin,et al.  Fast content-based image retrieval using quasi-Gabor filter and reduction of image feature dimension , 2002, Proceedings Fifth IEEE Southwest Symposium on Image Analysis and Interpretation.

[8]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Shin'ichi Satoh,et al.  Distinctiveness-sensitive nearest-neighbor search for efficient similarity retrieval of multimedia information , 2001, Proceedings 17th International Conference on Data Engineering.

[10]  Peter N. Yianilos,et al.  Locally lifting the curse of dimensionality for nearest neighbor search (extended abstract) , 2000, SODA '00.

[11]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[12]  Agma J. M. Traina,et al.  Assessing the best integration between distance-function and image-feature to answer similarity queries , 2008, SAC '08.

[13]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[14]  Hui Xiong,et al.  IDR/QR: An Incremental Dimension Reduction Algorithm via QR Decomposition , 2005, IEEE Trans. Knowl. Data Eng..

[15]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[16]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[17]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[18]  Agma J. M. Traina,et al.  Global warp metric distance: boosting content-based image retrieval through histograms , 2005, Seventh IEEE International Symposium on Multimedia (ISM'05).

[19]  Malcolm P. Atkinson,et al.  Issues Raised by Three Years of Developing PJama: An Orthogonally Persistent Platform for Java , 1999, ICDT.

[20]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[21]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[22]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.