On metricity of two heterogeneous measures in the presence of missing values

Heterogeneous Euclidean-overlap metric and heterogeneous value difference metric given in machine learning literature are useful for the consideration of mixed-type data for machine learning, pattern recognition and data mining tasks. Mixed-type variables are quite common in practical problems, but this property has been taken into account only seldom in pattern recognition, data mining and decision making algorithms. We observed that these two distance measures are not actually metrics after having found a special situation when they are not metric, but pseudometric, a feature to be noted while using them. Nevertheless, by changing their definitions somewhat, it is possible to meet the metricity. Especially in medical applications, the redefinition of the two measures might be important, since otherwise it is possible in theory that, for example, two identical cases would be classified differently. Nearest neighbor searching tests with medical data were run to illustrate the behavior of these measures. Notwithstanding the violation of the metricity their original forms yielded slightly better classification results. The reason was that in real data sets tested there were very few almost similar cases according to these distance measures, and the original forms based on more separating distances than the redefinitions were slightly better in the classification.

[1]  Martti Juhola,et al.  Treatment of missing data values in a neural network based decision support system for acute abdominal pain , 1998, Artif. Intell. Medicine.

[2]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[3]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[4]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[5]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[6]  M Juhola,et al.  Nearest neighbour classification with heterogeneous proximity functions. , 2000, Studies in health technology and informatics.

[7]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[8]  Martti Juhola,et al.  Analysis of the imputed female urinary incontinence data for the evaluation of expert system parameters , 2001, Comput. Biol. Medicine.

[9]  Martti Juhola,et al.  Neural Network Recognition of Otoneurological Vertigo Diseases with Comparison of Some Other Classification Methods , 1999, AIMDM.

[10]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[11]  Richard Johnsonbaugh,et al.  Discrete mathematics (2nd ed.) , 1990 .

[12]  Martti Juhola,et al.  Informal identification of outliers in medical data , 2000 .

[13]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[14]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[15]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[16]  King-Sun Fu,et al.  Digital pattern recognition , 1976, Communication and cybernetics.

[17]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[18]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..