A new distance metric for unsupervised learning of categorical data

Distance metric is the basis of many learning algorithms and its effectiveness usually has significant influence on the learning results. Generally, measuring distance for numerical data is a tractable task, but for categorical data sets, it could be a nontrivial problem. This paper therefore presents a new distance metric for categorical data based on the characteristics of categorical values. Specifically, the distance between two values from one attribute measured by this metric is determined by both of the frequency probabilities of these two values and the values of other attributes which have high interdependency with the calculated one. Promising experimental results on different real data sets have shown the effectiveness of proposed distance metric.

[1]  Hong Jia,et al.  Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number , 2013, Pattern Recognit..

[2]  Hong Jia,et al.  A Unified Metric for Categorical and Numerical Attributes in Data Clustering , 2013, PAKDD.

[3]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[4]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Chung-Chian Hsu,et al.  Generalizing self-organizing map for categorical data , 2006, IEEE Transactions on Neural Networks.

[6]  Chung-Chian Hsu,et al.  An integrated framework for visualized and exploratory pattern discovery in mixed data , 2006, IEEE Transactions on Knowledge and Data Engineering.

[7]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[8]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data , 2000 .

[10]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[11]  V. Batagelj,et al.  Comparing resemblance measures , 1995 .

[12]  Manabu Ichino,et al.  Generalized Minkowski metrics for mixed feature-type data analysis , 1994, IEEE Trans. Syst. Man Cybern..

[13]  K. Chidananda Gowda,et al.  Symbolic clustering using a new similarity measure , 1992, IEEE Trans. Syst. Man Cybern..

[14]  Edwin Diday,et al.  Unsupervised learning through symbolic clustering , 1991, Pattern Recognit. Lett..

[15]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[16]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[17]  Z. Hubálek COEFFICIENTS OF ASSOCIATION AND SIMILARITY, BASED ON BINARY (PRESENCE‐ABSENCE) DATA: AN EVALUATION , 1982 .

[18]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[19]  H. O. Lancaster The combination of probabilities arising from data in discrete distributions. , 1949, Biometrika.

[20]  Lipika Dey,et al.  A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set , 2007, Pattern Recognit. Lett..

[21]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[22]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[23]  Francisco de A. T. de Carvalho,et al.  Extension based proximities between constrained Boolean symbolic objects , 1998 .

[24]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[25]  Francisco de A. T. de Carvalho,et al.  Proximity Coefficients between Boolean symbolic objects , 1994 .