Efficient ensemble algorithm for mixed numeric and categorical data

Most previous clustering algorithms focus on numerical data whose inherent geometric properties can be exploited naturally to define distance functions between data points. However, much of the data existed in the databases is categorical, where attribute values cannot be naturally ordered as numerical values. Due to the differences in the characteristics of these two kinds of data, attempts to develop criteria functions for mixed data have been not very successful. In this research, we propose a novel divide-and-conquer technique to solve this problem. First, the original mixed dataset is divided into two sub-datasets: the pure categorical dataset and the pure numeric dataset. Next, existing well established clustering algorithms designed for different types of datasets are employed to produce corresponding clusters. Last, the clustering results on the categorical and numeric dataset are combined as a categorical dataset, on which the categorical data clustering algorithm is employed to get the final output. Our main contribution in this research is to provide an algorithm framework for the mixed attributes clustering problem, in which existing clustering algorithms can be easily integrated.