FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert [25]. Thus, we can subsequently use highly fine-tuned spatial access methods (SAMs), to answer several types of queries, including the 'Query By Example' type (which translates to a range query); the 'all pairs' query (which translates to a spatial join [8]); the nearest-neighbor or best-match query, etc.However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points.This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some k-dimensional space (k is user-defined), such that the dis-similarities are preserved. There are two benefits from this mapping: (a) efficient retrieval, in conjunction with a SAM, as discussed before and (b) visualization and data-mining: the objects can now be plotted as points in 2-d or 3-d space, revealing potential clusters, correlations among attributes and other regularities that data-mining is looking for.We introduce an older method from pattern recognition, namely, Multi-Dimensional Scaling (MDS) [51]; although unsuitable for indexing, we use it as yardstick for our method. Then, we propose a much faster algorithm to solve the problem in hand, while in addition it allows for indexing. Experiments on real and synthetic data indeed show that the proposed algorithm is significantly faster than MDS, (being linear, as opposed to quadratic, on the database size N), while it manages to preserve distances and the overall structure of the data-set.

[1]  H. V. Jagadish Spatial search with polyhedra , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[2]  Klaus H. Hinrichs,et al.  The Grid File: A Data Structure to Support Proximity Queries on Spatial Objects , 1983, International Workshop on Graph-Theoretic Concepts in Computer Science.

[3]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[4]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[5]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[6]  Marvin B. Shapiro The choice of reference points in best-match file searching , 1977, CACM.

[7]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[8]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[9]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[10]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[11]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[12]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[13]  Irene Gargantini,et al.  An effective way to represent quadtrees , 1982, CACM.

[14]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[15]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[18]  Jack A. Orenstein A comparison of spatial query processing techniques for native and parameter spaces , 1990, SIGMOD '90.

[19]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[20]  A OrensteinJack Spatial query processing in an object-oriented database system , 1986 .

[21]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[22]  P. Venkat Rangan,et al.  Multimedia conferencing in the Etherphone environment , 1991, Computer.

[23]  Dennis Shasha,et al.  New techniques for best-match retrieval , 1990, TOIS.

[24]  C. Faloutsos Eecient Similarity Search in Sequence Databases , 1993 .

[25]  David B. Lomet,et al.  The hB-tree: a multiattribute indexing method with good guaranteed performance , 1990, TODS.

[26]  Susan T. Dumais,et al.  Personalized information delivery: an analysis of information filtering methods , 1992, CACM.

[27]  A. Ravishankar Rao,et al.  Identifying High Level Features of Texture Perception , 1993, CVGIP Graph. Model. Image Process..

[28]  Stavros Christodoulakis,et al.  Multimedia Information Systems: The Unfolding of a Reality (Guest Editors' Introduction) , 1991, Computer.

[29]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[30]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[31]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[32]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[33]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[34]  Forrest W. Young Multidimensional Scaling: History, Theory, and Applications , 1987 .

[35]  Myron Wish,et al.  Basic Concepts of Multidimensional Scaling , 1978 .

[36]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[37]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[38]  R. Shepard The analysis of proximities: Multidimensional scaling with an unknown distance function. II , 1962 .

[39]  H. V. Jagadish,et al.  A retrieval technique for similar shapes , 1991, SIGMOD '91.

[40]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[41]  B LometDavid,et al.  The hB-tree: a multiattribute indexing method with good guaranteed performance , 1990 .

[42]  H. V. Jagadish,et al.  Linear clustering of objects with multiple attributes , 1990, SIGMOD '90.

[43]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[44]  A. Ravishankar Rao,et al.  Identifying high-level features of texture perception , 1992, Electronic Imaging.

[45]  Jack A. Orenstein Spatial query processing in an object-oriented database system , 1986, SIGMOD '86.

[46]  Gene H. Golub,et al.  Matrix computations , 1983 .

[47]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[48]  Christos Faloutsos,et al.  QBIC project: querying images by content, using color, texture, and shape , 1993, Electronic Imaging.

[49]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[50]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[51]  Hans-Peter Kriegel,et al.  Multi-step processing of spatial joins , 1994, SIGMOD '94.

[52]  Christos Faloutsos,et al.  Fractals for secondary key retrieval , 1989, PODS.

[53]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[54]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..