Many methods of knowledge discovery and data mining are distance-based such as nearest neighbor classification or clustering where similarity measures between objects play an essential role. While real-world databases are often heterogeneous with mixed numeric and symbolic attributes, most available similarity measures can only be applied to either symbolic or numeric data. In such cases, data mining methods often require transforming numeric data into symbolic ones by discretization techniques. Mixed similarity measures (MSMs) without discretization of numeric values are desirable alternatives for objects with mixed symbolic and numeric data. However, the time and space complexities of computing available MSMs are often very high that make MSMs not applicable to large datasets. In the framework of Goodall's MSM inspired by biological taxonomy, computing methods have been done but their time and space complexities so far are at least O(n2 log n2) and O(n2), respectively. In this work, we propose a new and efficient method for computing this MSM with O(n log n) time and O(n) space complexities. We demonstrate experimentally the applicability of new method to large datasets and suggest meta-knowledge on the use of this MSM. Practically, the experimental results show that only the near-linear time and space MSM could be applicable to mining large heterogeneous datasets.
[1]
D. W. Goodall.
A New Similarity Index Based on Probability
,
1966
.
[2]
Tu Bao Ho,et al.
Study of a Mixed Similarity Measure for Classification and Clustering
,
1999,
PAKDD.
[3]
K. Chidananda Gowda,et al.
Symbolic clustering using a new similarity measure
,
1992,
IEEE Trans. Syst. Man Cybern..
[4]
Manabu Ichino,et al.
Generalized Minkowski metrics for mixed feature-type data analysis
,
1994,
IEEE Trans. Syst. Man Cybern..
[5]
Alberto Maria Segre,et al.
Programs for Machine Learning
,
1994
.
[6]
H. O. Lancaster.
The combination of probabilities arising from data in discrete distributions.
,
1949,
Biometrika.
[7]
Zhexue Huang,et al.
CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES
,
1997
.