Indexing Images in High-Dimensional and Dynamic-Weighted Feature Spaces

As information retrieval systems evolve to deal with multimedia content, we see the dimensions of content feature space increasing, and relevance feedback being employed to provide more accurate query results. In this paper, we propose using Tree-structured Vector Quantization (TSVQ) to index high-dimensional data for supporting efficient similarity searches and effective relevance feedback. To support efficient similarity searches, we first use TSVQ to cluster data and store each cluster in a sequential file. We then model a similarity search as a classification problem—similar objects are much more likely to be found in the clusters into which the query object is classified. When relevance feedback is considered, and thereby features are weighted differently, we show that our approach remains very effective. Our empirical study on both a 51K and a one-million-image dataset shows that tackling indexing as a classification problem and solving the problem with TSVQ is efficient, effective, and scalable with respect to both data dimensions and dataset size.

[1]  Edward Y. Chang,et al.  RIME: a replicated image detector for the World Wide Web , 1998, Other Conferences.

[2]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[3]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[4]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[5]  Kenneth L. Clarkson,et al.  An algorithm for approximate closest-point queries , 1994, SCG '94.

[6]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[7]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[8]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[9]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[10]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[11]  Hector Garcia-Molina,et al.  Safeguarding and charging for information on the Internet , 1998, Proceedings 14th International Conference on Data Engineering.

[12]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[13]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[14]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[15]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[16]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[17]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[18]  Ron Kohavi,et al.  The Utility of Feature Weighting in Nearest-Neighbor Algorithms , 1997 .

[19]  E. Y. Chang,et al.  Toward perception-based image retrieval , 2000, 2000 Proceedings Workshop on Content-based Access of Image and Video Libraries.

[20]  Marco Patella,et al.  PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[21]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[22]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[23]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[24]  J. T. Robinson,et al.  The K-D-B-tree: a search structure for large multidimensional dynamic indexes , 1981, SIGMOD '81.

[25]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[26]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[27]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[28]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[29]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[30]  Ramesh C. Jain,et al.  Similarity indexing: algorithms and performance , 1996, Electronic Imaging.

[31]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.