Clindex: Clustering for Similarity Queries in High-Dimensional Spaces.

Clindex: Clustering for Similarity Queries in High-Dimensional Spaces Chen Li, Edward Chang, Hector Garcia-Molina James Ze Wang and Gio Wiederhold Department of Computer Science, Stanford University Abstract In this paper we present a clustering and indexing paradigm (called Clindex) for highdimensional search spaces. The scheme is designed for approximate searches, where one wishes to find many of the data points near a target point, but where one can tolerate missing a few near points. For such searches, our scheme can find near points with high recall in very few IOs and performs significantly better than other approaches. Our scheme is based on finding clusters, and then building a simple but efficient index for them. We analyze the tradeoffs involved in clustering and building such an index structure, and present experimental results based on a 30,000 image database. Keywords: similarity search, multidimensional indexes.

[1]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[2]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[3]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[4]  James Ze Wang,et al.  Wavelet-based image indexing techniques with partial sketch retrieval capability , 1997, Proceedings of ADL '97 Forum on Research and Technology. Advances in Digital Libraries.

[5]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[6]  Leonidas J. Guibas,et al.  Adaptive Color-Image Embeddings for Database Navigation , 1998, ACCV.

[7]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[8]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[9]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[10]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[11]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[12]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[13]  Kenneth L. Clarkson,et al.  An algorithm for approximate closest-point queries , 1994, SCG '94.

[14]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[15]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[16]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[17]  James Ze Wang,et al.  Content-based image indexing and searching using Daubechies' wavelets , 1998, International Journal on Digital Libraries.

[18]  Edward Y. Chang,et al.  RIME: a replicated image detector for the World Wide Web , 1998, Other Conferences.

[19]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[20]  Ramesh C. Jain,et al.  Similarity indexing: algorithms and performance , 1996, Electronic Imaging.

[21]  Santosh S. Vempala,et al.  Locality-preserving hashing in multidimensional spaces , 1997, STOC '97.

[22]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[23]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[24]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[25]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[26]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[27]  Ambuj K. Singh,et al.  Dimensionality reduction for similarity searching in dynamic databases , 1998, SIGMOD '98.

[28]  Hector Garcia-Molina,et al.  Safeguarding and charging for information on the Internet , 1998, Proceedings 14th International Conference on Data Engineering.

[29]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[30]  Alberto Del Bimbo,et al.  Visual information retrieval , 1999 .

[31]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[32]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[33]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.