论文信息 - Distance functions for categorical and mixed variables

Distance functions for categorical and mixed variables

In this paper, we compare three different measures for computing Mahalanobis-type distances between random variables consisting of several categorical dimensions or mixed categorical and numeric dimensions - regular simplex, tensor product space, and symbolic covariance. The tensor product space and symbolic covariance distances are new contributions. We test the methods on two application domains - classification and principal components analysis. We find that the tensor product space distance is impractical with most problems. Over all, the regular simplex method is the most successful in both domains, but the symbolic covariance method has several advantages including time and space efficiency, applicability to different contexts, and theoretical neatness.

Brendan McCane | Michael Albert

[1] G. McLachlan. Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[2] Kwong-Sak Leung,et al. Intelligent Data Engineering and Automated Learning — IDEAL 2000. Data Mining, Financial Engineering, and Intelligent Agents , 2002, Lecture Notes in Computer Science.

[3] David L. Waltz,et al. Toward memory-based reasoning , 1986, CACM.

[4] J. Gower. A General Coefficient of Similarity and Some of Its Properties , 1971 .

[5] Joshua Zhexue Huang,et al. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[6] Jean-Jacques Daudin,et al. Generalization of the Mahalanobis distance in the mixed case , 1995 .

[7] Takashi Okada,et al. A Note on Covariances for Categorical Data , 2000, IDEAL.

[8] J. Friedman. Regularized Discriminant Analysis , 1989 .

[9] C. Cuadras,et al. The Proximity of an Individual to a Population with Applications in Discriminant Analysis , 1997 .

[10] J. Podani. Extending Gower's general coefficient of similarity to ordinal characters , 1999 .

[11] T. Kurczynski,et al. Generalized Distance and Discrete Variables , 1970 .