Distance functions for categorical and mixed variables

In this paper, we compare three different measures for computing Mahalanobis-type distances between random variables consisting of several categorical dimensions or mixed categorical and numeric dimensions - regular simplex, tensor product space, and symbolic covariance. The tensor product space and symbolic covariance distances are new contributions. We test the methods on two application domains - classification and principal components analysis. We find that the tensor product space distance is impractical with most problems. Over all, the regular simplex method is the most successful in both domains, but the symbolic covariance method has several advantages including time and space efficiency, applicability to different contexts, and theoretical neatness.

[1]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[2]  Kwong-Sak Leung,et al.  Intelligent Data Engineering and Automated Learning — IDEAL 2000. Data Mining, Financial Engineering, and Intelligent Agents , 2002, Lecture Notes in Computer Science.

[3]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[4]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[5]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[6]  Jean-Jacques Daudin,et al.  Generalization of the Mahalanobis distance in the mixed case , 1995 .

[7]  Takashi Okada,et al.  A Note on Covariances for Categorical Data , 2000, IDEAL.

[8]  J. Friedman Regularized Discriminant Analysis , 1989 .

[9]  C. Cuadras,et al.  The Proximity of an Individual to a Population with Applications in Discriminant Analysis , 1997 .

[10]  J. Podani Extending Gower's general coefficient of similarity to ordinal characters , 1999 .

[11]  T. Kurczynski,et al.  Generalized Distance and Discrete Variables , 1970 .

[12]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[13]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[14]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[15]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[16]  Wojtek J. Krzanowski,et al.  The location model for mixtures of categorical and continuous variables , 1993 .

[17]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[18]  Pedro M. Domingos Unifying Instance-Based and Rule-Based Induction , 1996, Machine Learning.

[19]  S. Salzberg,et al.  A weighted nearest neighbor algorithm for learning with symbolic features , 2004, Machine Learning.