High-Dimensional Nearest Neighbor Search with Remote Data Centers

Abstract. Many data centers have archived a tremendous amount of data and begun to publish them on the Web. Due to limited resources and large amount of service requests, data centers usually do not directly support high-cost queries. On the other hand, users are often overwhelmed by the huge data volume and cannot afford to download the whole data sets and search them locally. To support high-dimensional nearest neighbor searches in this environment, the paper develops a multi-level approximation scheme. The coarsest-level approximations are stored locally and searched first. The result is then refined gradually via accesses to remote data centers. Data centers need only to deliver data items or their precomputed finer level approximations by their identifiers.The searching process is usually long in this environment, since it involves remote sites. This paper describes an online search process: the system periodically reports a data item and a positive integer M. The reported item is guaranteed to be one of the M nearest neighbors of the query one. The paper proposes two algorithms to minimize M in each period. Experiments show that one of them performs similarly as a theoretical a posteriori algorithm and significantly outperforms the online extensions of two state-of-the-art nearest neighbor search methods.

[1]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[2]  Joseph M. Hellerstein,et al.  CONTROL: continuous output and navigation technology with refinement on-line , 1998, SIGMOD '98.

[3]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[4]  Joseph M. Hellerstein,et al.  Online Dynamic Reordering for Interactive Data Processing , 1999, VLDB.

[5]  Davood Rafiei,et al.  On similarity-based queries for time series data , 1997, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[6]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[7]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[8]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[9]  Allan Borodin,et al.  Lower bounds for high dimensional nearest neighbor search and related problems , 1999, STOC '99.

[10]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[11]  Peter J. Haas,et al.  Interactive data Analysis: The Control Project , 1999, Computer.

[12]  Joseph M. Hellerstein Online Processing Redux , 1997, IEEE Data Eng. Bull..

[13]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[14]  Changzhou Wang,et al.  Indexing very high-dimensional sparse and quasi-sparse vectors for similarity searches , 2001, The VLDB Journal.

[15]  F. Hamprecht Introduction to Statistics , 2022 .

[16]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[17]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[18]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[19]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[20]  Zbigniew R. Struzik,et al.  The Haar Wavelet Transform in the Time Series Similarity Paradigm , 1999, PKDD.

[21]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[22]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[23]  Hans-Peter Kriegel,et al.  Fast nearest neighbor search in high-dimensional space , 1998, Proceedings 14th International Conference on Data Engineering.

[24]  Stefan Berchtold,et al.  High-dimensional index structures database support for next decade's applications (tutorial) , 1998, SIGMOD '98.

[25]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[26]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[27]  J. Hammersley SIMULATION AND THE MONTE CARLO METHOD , 1982 .

[28]  Changzhou Wang,et al.  Remote data access via the SIESIP distributed information system , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[29]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[30]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.