Similarity of Attributes by External Probes

In data mining, similarity or distance between attributes is one of the central notions. Such a notion can be used to build attribute hierarchies etc. Similarity metrics can be user-defined, but an important problem is defining similarity on the basis of data. Several methods based on statistical techniques exist. For defining the similarity between two attributes A and B they typically consider only the values of A and B, not the other attributes. We describe how a similarity notion between attributes can be defined by considering the values of other attributes. The basic idea is that in a 0/1 relation r, two attributes A and B are similar if the subrelations σA=1(r) and σB=1(r) are similar. Similarity between the two relations is defined by considering the marginal frequencies of a selected subset of other attributes. We show that the framework produces natural notions of similarity. Empirical results on the Reuters-21578 document dataset show, for example, how natural classifications for countries can be discovered from keyword distributions in documents. The similarity notion is easily computable with scalable algorithms.

[1]  Arno J. Knobbe,et al.  Analysing Binary Associations , 1996, KDD.

[2]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[3]  Jiawei Han,et al.  Knowledge Discovery in Databases: An Attribute-Oriented Approach , 1992, VLDB.

[4]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[5]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[6]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[7]  Alberto O. Mendelzon,et al.  Similarity-based queries , 1995, PODS '95.

[8]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[9]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[10]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[11]  C. Faloutsos Eecient Similarity Search in Sequence Databases , 1993 .

[12]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[13]  Raj Jain,et al.  Algorithms and strategies for similarity retrieval , 1996 .

[14]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[15]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[16]  S W Guo,et al.  Linkage disequilibrium measures for fine-scale mapping: a comparison. , 1997, Human heredity.

[17]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[18]  M. Basseville Distance measures for signal processing and pattern recognition , 1989 .

[19]  Heikki Mannila,et al.  Knowledge discovery from telecommunication network alarm databases , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[20]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[21]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .