论文信息 - CVA file: an index structure for high-dimensional datasets - 字舞流文

CVA file: an index structure for high-dimensional datasets

Similarity search is important in information-retrieval applications where objects are usually represented as vectors of high dimensionality. This paper proposes a new dimensionality-reduction technique and an indexing mechanism for high-dimensional datasets. The proposed technique reduces the dimensions for which coordinates are less than a critical value with respect to each data vector. This flexible datawise dimensionality reduction contributes to improving indexing mechanisms for high-dimensional datasets that are in skewed distributions in all coordinates. To apply the proposed technique to information retrieval, a CVA file (compact VA file), which is a revised version of the VA file is developed. By using a CVA file, the size of index files is reduced further, while the tightness of the index bounds is held maximally. The effectiveness is confirmed by synthetic and real data.

Jiyuan An | Hanxiong Chen | Nobuo Ohbo | Kazutaka Furuse | N. Ohbo | Hanxiong Chen | K. Furuse | Jiyuan An

[1] Keinosuke Fukunaga,et al. Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[2] K. Wakimoto,et al. Efficient and Effective Querying by Image Content , 1994 .

[3] Christos Faloutsos,et al. Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[4] George Kingsley Zipf,et al. Human behavior and the principle of least effort , 1949 .

[5] Christian Böhm,et al. A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[6] Jonathan Goldstein,et al. When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[7] Shin'ichi Satoh,et al. The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[8] Hans-Peter Kriegel,et al. The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[9] Sharad Mehrotra,et al. Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[10] Hans-Jörg Schek,et al. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[11] Christos Faloutsos,et al. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[12] Michalis Faloutsos,et al. On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[13] Masahiro Ishikawa,et al. The Complex Polyhedra Technique: An Index Structure for High-Dimensional Space , 2002, Australasian Database Conference.

[14] Hans-Peter Kriegel,et al. The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[15] Jiyuan An,et al. Approximate Retrieval of High-Dimensional Data by Spatial Indexing , 1998, Discovery Science.

[16] Charu C. Aggarwal,et al. On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[17] Keinosuke Fukunaga,et al. Statistical Pattern Recognition , 1993, Handbook of Pattern Recognition and Computer Vision.

[18] Philip S. Yu,et al. Fast algorithms for projected clustering , 1999, SIGMOD '99.

[19] Jiyuan An,et al. C2VA: Trim High Dimensional Indexes , 2002, WAIM.

[20] Eamonn J. Keogh,et al. Grid-Based Indexing for Large Time Series Databases , 2003, IDEAL.