Similarity search in sets and categorical data using the signature tree

Data mining applications analyze large collections of set data and high dimensional categorical data. Search on these data types is not restricted to the classic problems of mining association rules and classification, but similarity search is also a frequently applied operation. Access methods/or multidimensional numerical data are inappropriate for this problem and specialized indexes are needed. We propose a method that represents set data as bitmaps (signatures) and organizes them into a hierarchical index, suitable for similarity search and other related query types. In contrast to a previous technique, the signature tree is dynamic and does not rely on hardwired constants. Experiments with synthetic and real datasets show that it is robust to different data characteristics, scalable to the database size and efficient for various queries.

[1]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[2]  G. Moerkotte,et al.  A Study of Four Index Structures for Set-Valued Attributes of Low Cardinality , 1999 .

[3]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[4]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[5]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[6]  Masatoshi Yoshikawa,et al.  The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation , 2000, VLDB.

[7]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[8]  Philip S. Yu,et al.  A new method for similarity indexing of market basket data , 1999, SIGMOD '99.

[9]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[10]  Dimitrios Gunopulos,et al.  Efficient and tumble similar set retrieval , 2001, SIGMOD '01.

[11]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[12]  ManolopoulosYannis,et al.  Closest pair queries in spatial databases , 2000 .

[13]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[14]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[15]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[16]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[17]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[18]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[19]  Martin L. Kersten,et al.  Efficient k-NN search on vertically decomposed data , 2002, SIGMOD '02.

[20]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[21]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[22]  Uwe Deppisch,et al.  S-tree: a dynamic balanced signature index for office retrieval , 1986, SIGIR '86.