论文信息 - Similarity of Attributes by External Probes - 字舞流文

Similarity of Attributes by External Probes

In data mining, similarity or distance between attributes is one of the central notions. Such a notion can be used to build attribute hierarchies etc. Similarity metrics can be user-defined, but an important problem is defining similarity on the basis of data. Several methods based on statistical techniques exist. For defining the similarity between two attributes A and B they typically consider only the values of A and B, not the other attributes. We describe how a similarity notion between attributes can be defined by considering the values of other attributes. The basic idea is that in a 0/1 relation r, two attributes A and B are similar if the subrelations σA=1(r) and σB=1(r) are similar. Similarity between the two relations is defined by considering the marginal frequencies of a selected subset of other attributes. We show that the framework produces natural notions of similarity. Empirical results on the Reuters-21578 document dataset show, for example, how natural classifications for countries can be discovered from keyword distributions in documents. The similarity notion is easily computable with scalable algorithms.

Heikki Mannila | Gautam Das | Pirjo Ronkainen | H. Mannila | Gautam Das | Pirjo Ronkainen

[1] Arno J. Knobbe,et al. Analysing Binary Associations , 1996, KDD.

[2] Heikki Mannila,et al. Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[3] Jiawei Han,et al. Knowledge Discovery in Databases: An Attribute-Oriented Approach , 1992, VLDB.

[4] R. A. Leibler,et al. On Information and Sufficiency , 1951 .

[5] Dina Q. Goldin,et al. On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[6] Kyuseok Shim,et al. Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[7] Alberto O. Mendelzon,et al. Similarity-based queries , 1995, PODS '95.

[8] Tomasz Imielinski,et al. Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[9] Ramakrishnan Srikant,et al. Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[10] Christos Faloutsos,et al. Efficient Similarity Search In Sequence Databases , 1993, FODO.

[11] C. Faloutsos. Eecient Similarity Search in Sequence Databases , 1993 .

[12] Rajeev Motwani,et al. Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[13] Raj Jain,et al. Algorithms and strategies for similarity retrieval , 1996 .

[14] Alberto O. Mendelzon,et al. Similarity-based queries for time series data , 1997, SIGMOD '97.

[15] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[16] S W Guo,et al. Linkage disequilibrium measures for fine-scale mapping: a comparison. , 1997, Human heredity.

[17] Anil K. Jain,et al. Algorithms for Clustering Data , 1988 .

[18] M. Basseville. Distance measures for signal processing and pattern recognition , 1989 .

[19] Heikki Mannila,et al. Knowledge discovery from telecommunication network alarm databases , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[20] L. A. Goodman,et al. Measures of association for cross classifications , 1979 .

[21] Wendy R. Fox,et al. Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .