Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation

Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. In this study, the mutual information (MI)-based unsupervised feature transformation (UFT), which can transform non-numerical features into numerical features without information loss, was utilized with the conventional k-means algorithm for heterogeneous data clustering. For the original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world datasets showed that, the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data with both numerical and non-numerical features.

[1]  Xiao Han,et al.  A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data , 2012, Knowl. Based Syst..

[2]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[3]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[4]  R. Blundell,et al.  Initial Conditions and Moment Restrictions in Dynamic Panel Data Models , 1998 .

[5]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[6]  Hong He,et al.  A two-stage genetic algorithm for automatic clustering , 2012, Neurocomputing.

[7]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[8]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[9]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[10]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[11]  Manlio De Domenico,et al.  Entropic Approach to Multiscale Clustering Analysis , 2012, Entropy.

[12]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[13]  Gil David,et al.  SpectralCAT: Categorical spectral clustering of numerical and nominal data , 2012, Pattern Recognit..

[14]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[15]  Sotirios Chatzis,et al.  A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional , 2011, Expert Syst. Appl..

[16]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[17]  Chunguang Zhou,et al.  An improved k-prototypes clustering algorithm for mixed numeric and categorical data , 2013, Neurocomputing.

[18]  Wei-Shen Tai,et al.  Apply extended self-organizing map to cluster and classify mixed-type data , 2011, Neurocomputing.

[19]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[21]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[22]  Frank Nielsen,et al.  On Clustering Histograms with k-Means by Using Mixed α-Divergences , 2014, Entropy.

[23]  Chung-Chian Hsu,et al.  Mining of mixed data with application to catalog marketing , 2007, Expert Syst. Appl..