CSVD: approximate similarity searches in high-dimensional spaces using clustering and singular value decomposition

Many data-intensive applications, such as content-based retrieval of images or video from multimedia databases and similarity retrieval of patterns in data mining, require the ability of efficiently performing similarity queries. Unfortunately, the performance of nearest neighbor (NN) algorithms, the basis for similarity search, quickly deteriorates with the number of dimensions. In this paper we propose a method called Clustering with Singular Value Decomposition (CSVD), combining clustering and singular value decomposition (SVD) to reduce the number of index dimensions. With CSVD, points are grouped into clusters that are more amenable to dimensionally reduction than the original dataset. Experiments with texture vectors extracted from satellite images show that CSVD achieves significantly higher dimensionality reduction than SVD along for the same fraction of total variance preserved. Conversely, for the same compression ratio CSVD results in an increase in preserved total variance with respect to SVD (e.g., at 70% increase for a 20:1 compression ratio). Then, approximate NN queries are more efficiently processed, as quantified through experimental results.

[1]  Shih-Fu Chang,et al.  VisualSEEk: a fully automated content-based image query system , 1997, MULTIMEDIA '96.

[2]  Song B. Park,et al.  A Fast k Nearest Neighbor Finding Algorithm Based on the Ordered Partition , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[4]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[5]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[6]  Jesse S. Jin,et al.  SS+ tree: an improved index structure for similarity searches in a high-dimensional feature space , 1997, Electronic Imaging.

[7]  Brian Everitt,et al.  Cluster analysis , 1974 .

[8]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[9]  B. S. Manjunath,et al.  Dimensionality reduction using multi-dimensional scaling for content-based retrieval , 1997, Proceedings of International Conference on Image Processing.

[10]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[11]  A A Lammertsma,et al.  Linear dimension reduction of sequences of medical images: I. Optimal inner products. , 1995, Physics in medicine and biology.

[12]  Peiya Liu,et al.  Content-based indexing technique using relative geometry features , 1992, Electronic Imaging.

[13]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[14]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[15]  Ramin Samadani,et al.  Content-based event selection from satellite images of the aurora , 1993, Electronic Imaging.

[16]  A A Lammertsma,et al.  Linear dimension reduction of sequences of medical images: III. Factor analysis in signal space. , 1996, Physics in medicine and biology.

[17]  Stephen W. Smoliar,et al.  Content based video indexing and retrieval , 1994, IEEE MultiMedia.

[18]  Christos Faloutsos,et al.  QBIC project: querying images by content, using color, texture, and shape , 1993, Electronic Imaging.

[19]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[20]  Chung-Sheng Li,et al.  Progressive content-based retrieval from distributed image/video databases , 1997, Proceedings of 1997 IEEE International Symposium on Circuits and Systems. Circuits and Systems in the Information Age ISCAS '97.

[21]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[22]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.