Context-Based Similarity Measures for Categorical Databases

Similarity between complex data objects is one of the central notions in data mining. We propose certain similarity (or distance) measures between various components of a 0/1 relation. We define measures between attributes, between rows, and between subrelations of the database. They find important applications in clustering, classification, and several other data mining processes. Our measures are based on the contexts of individual components. For example, two products (i.e., attributes) are deemed similar if their respective sets of customers (i.e., subrelations) are similar. This reveals more subtle relationships between components, something that is usually missing in simpler measures. Our problem of finding distance measures can be formulated as a system of nonlinear equations. We present an iterative algorithm which, when seeded with random initial values, converges quickly to stable distances in practice (typically requiring less than five iterations). The algorithm requires only one database scan. Results on artificial and real data show that our method is efficient, and produces results with intuitive appeal.

[1]  Shimon Edelman,et al.  Similarity-based Word Sense Disambiguation , 1998, CL.

[2]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[3]  Arno J. Knobbe,et al.  Analysing Binary Associations , 1996, KDD.

[4]  Jiawei Han,et al.  Knowledge Discovery in Databases: An Attribute-Oriented Approach , 1992, VLDB.

[5]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[6]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[7]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[8]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[9]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[10]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[11]  Heikki Mannila,et al.  Distance measures for point sets and their computation , 1997, Acta Informatica.

[12]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[13]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[14]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[15]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[16]  Vipin Kumar,et al.  Clustering Based On Association Rule Hypergraphs , 1997, DMKD.

[17]  Alberto O. Mendelzon,et al.  Similarity-based queries , 1995, PODS '95.

[18]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[19]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[20]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[21]  Heikki Mannila,et al.  Similarity of Attributes by External Probes , 1998, KDD.

[22]  Raj Jain,et al.  Algorithms and strategies for similarity retrieval , 1996 .