论文信息 - A similarity algorithm for categorical variables

A similarity algorithm for categorical variables

How to measure the similarity of data objects is one of the most important problems in the data analysis. This paper proposes a method which uses only information of the distribution of attributes to measure the similarity between nominal data objects. In this algorithm, we made the logarithm form of the conditional probability the main interest, because we think that the distribution information is the only information that a dataset can tell us without domain knowledge. First we calculate conditional probability of the target data objects and every other attributes. Then we turn them into logarithm form and sort by the data objects. In last step, we use the average value of each attribute column to compose the feature vector of data objects, and the Euclidean distance will be the similarity metrics between the data objects. The experiments on extensive UCI data sets based on the derived similarity metrics will show the considerable accuracy.

Liang Zhao | Jian-Hui Liu

[1] Jiawei Han,et al. Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2] Jianhong Wu,et al. Data clustering - theory, algorithms, and applications , 2007 .

[3] Ulrike von Luxburg,et al. A tutorial on spectral clustering , 2007, Stat. Comput..

[4] David L. Waltz,et al. Toward memory-based reasoning , 1986, CACM.

[5] Vipin Kumar,et al. Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[6] Longbing Cao,et al. Coupled nominal similarity in unsupervised learning , 2011, CIKM '11.

[7] Lipika Dey,et al. A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[8] Philip S. Yu,et al. Coupled Behavior Analysis with Applications , 2012, IEEE Transactions on Knowledge and Data Engineering.

[9] Heikki Mannila,et al. Context-Based Similarity Measures for Categorical Databases , 2000, PKDD.

[10] Tony R. Martinez,et al. Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[11] S. Salzberg,et al. A weighted nearest neighbor algorithm for learning with symbolic features , 2004, Machine Learning.

[12] Umar Qasim,et al. Active caching for similarity queries based on shared-neighbor information , 2010, CIKM.