for categorical data

In this paper, we propose a novel method to measure the dissimilarity of categorical data. The key idea is to consider the dissimilarity between two categorical values of an attribute as a combination of dissimilarities between the conditional probability distributions of other attributes given these two values. Experiments with real data show that our dissimilarity estimation method improves the accuracy of the popular nearest neighbor classifier.

[1]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  Francisco de A. T. de Carvalho Extension based proximities between constrained Boolean symbolic objects , 1998 .

[4]  V. Batagelj,et al.  Comparing resemblance measures , 1995 .

[5]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[6]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[7]  John W. Tukey,et al.  Statistical Methods for Research Workers , 1930, Nature.

[8]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[9]  Edwin Diday,et al.  Unsupervised learning through symbolic clustering , 1991, Pattern Recognit. Lett..

[10]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[11]  Jaap Van Brakel,et al.  Foundations of measurement , 1983 .

[12]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[13]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[14]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[15]  Francisco de A. T. de Carvalho,et al.  Proximity Coefficients between Boolean symbolic objects , 1994 .

[16]  Graham K. Rand,et al.  Quantitative Applications in the Social Sciences , 1983 .

[17]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[18]  Hans-Hermann Bock,et al.  Data Science, Classification and Related Methods , 1998 .

[19]  H. O. Lancaster The combination of probabilities arising from data in discrete distributions. , 1949, Biometrika.

[20]  P. F. Russell,et al.  On Habitat and Association of Species of Anopheline Larvae in South-eastern Madras. , 1940 .

[21]  H O LANCASTER The combination of probabilities arising from data in discrete distributions. , 1949, Biometrika.

[22]  K. Chidananda Gowda,et al.  Symbolic clustering using a new similarity measure , 1992, IEEE Trans. Syst. Man Cybern..

[23]  Manabu Ichino,et al.  Generalized Minkowski metrics for mixed feature-type data analysis , 1994, IEEE Trans. Syst. Man Cybern..

[24]  H. Ross Principles of Numerical Taxonomy , 1964 .

[25]  Fady Alajaji,et al.  Rényi's divergence and entropy rates for finite alphabet Markov sources , 2001, IEEE Trans. Inf. Theory.

[26]  Z. Hubálek COEFFICIENTS OF ASSOCIATION AND SIMILARITY, BASED ON BINARY (PRESENCE‐ABSENCE) DATA: AN EVALUATION , 1982 .

[27]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[28]  Jonathan Barzilai,et al.  On the foundations of measurement , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[29]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[30]  F. B. Baulieu A classification of presence/absence based dissimilarity coefficients , 1989 .