A General Cost Model for Dimensionality Reduction in High Dimensional Spaces

Similarity search usually encounters a serious problem in the high dimensional space, known as the "curse of dimensionality". In order to speed up the retrieval efficiency, previous approaches usually reduce the dimensionality of the entire data set to a fixed lower value before building indexes (referred to as global dimensionality reduction (GDR)). More recent works focus on locally reducing the dimensionality of data to different values (called the local dimensionality reduction (LDR)). However, so far little work has formally evaluated the effectiveness and efficiency of both GDR and LDR for range queries. Motivated by this, in this paper, we propose a general cost model for both GDR and LDR, in light of which we introduce a novel LDR method, PRANS. It can achieve high retrieval efficiency with the guarantee of optimality given by the formal model. Finally, a B+-tree index is constructed over the reduced partitions for fast similarity search. Extensive experiments validate the correctness of our cost model on both real and synthetic data sets, and demonstrate the efficiency and effectiveness of the proposed PRANS method.

[1]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[3]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[4]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[5]  Ron Kohavi,et al.  Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology , 1995, KDD.

[6]  Sharad Mehrotra,et al.  The hybrid tree: an index structure for high dimensional feature spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[7]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[8]  C. C. Heyde,et al.  Central Limit Theorem , 2006 .

[9]  Beng Chin Ooi,et al.  An adaptive and efficient dimensionality reduction algorithm for high-dimensional indexing , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[10]  Raymond T. Ng,et al.  Indexing spatio-temporal trajectories with Chebyshev polynomials , 2004, SIGMOD '04.

[11]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[12]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[13]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[14]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[15]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[16]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[17]  Jörg Sander,et al.  A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing , 2005, VLDB.

[18]  Christos Faloutsos,et al.  The TV-tree: An index structure for high-dimensional data , 1994, The VLDB Journal.

[19]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[20]  Ambuj K. Singh,et al.  Dimensionality reduction for similarity searching in dynamic databases , 1998, SIGMOD '98.

[21]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[22]  Jing Huang,et al.  Image indexing using color correlograms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Deng Cai,et al.  Tensor Subspace Analysis , 2005, NIPS.

[24]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[25]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[26]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[27]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[28]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..