Clustering Mixed Data Based on Density Peaks and Stacked Denoising Autoencoders

With the universal existence of mixed data with numerical and categorical attributes in real world, a variety of clustering algorithms have been developed to discover the potential information hidden in mixed data. Most existing clustering algorithms often compute the distances or similarities between data objects based on original data, which may cause the instability of clustering results because of noise. In this paper, a clustering framework is proposed to explore the grouping structure of the mixed data. First, the transformed categorical attributes by one-hot encoding technique and normalized numerical attributes are input to a stacked denoising autoencoders to learn the internal feature representations. Secondly, based on these feature representations, all the distances between data objects in feature space can be calculated and the local density and relative distance of each data object can be also computed. Thirdly, the density peaks clustering algorithm is improved and employed to allocate all the data objects into different clusters. Finally, experiments conducted on some UCI datasets have demonstrated that our proposed algorithm for clustering mixed data outperforms three baseline algorithms in terms of the clustering accuracy and the rand index.

[1]  Daniel Cremers,et al.  Clustering with Deep Learning: Taxonomy and New Methods , 2018, ArXiv.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Hong Jia,et al.  Subspace Clustering of Categorical and Numerical Data With an Unknown Number of Clusters , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Xiao Han,et al.  A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data , 2012, Knowl. Based Syst..

[6]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[7]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jiye Liang,et al.  Space Structure and Clustering of Categorical Data , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Renato Bruni,et al.  A min-cut approach to functional regionalization, with a case study of the Italian local labour market areas , 2016, Optim. Lett..

[10]  Sharmila Subudhi,et al.  A hybrid mobile call fraud detection model using optimized fuzzy C-means clustering and group method of data handling-based network , 2018, Vietnam Journal of Computer Science.

[11]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[12]  Jun Wang,et al.  Deep Learning over Multi-field Categorical Data - - A Case Study on User Response Prediction , 2016, ECIR.

[13]  Maoguo Gong,et al.  Unsupervised evolutionary clustering algorithm for mixed type data , 2010, IEEE Congress on Evolutionary Computation.

[14]  Qiang Wang,et al.  Fuzzy soft subspace clustering method for gene co-expression network analysis , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[15]  Qiang Liu,et al.  A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture , 2018, IEEE Access.

[16]  Huachun Tan,et al.  Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering , 2016, IJCAI.

[17]  Ivan Marsic,et al.  From Categorical to Numerical: Multiple Transitive Distance Learning and Embedding , 2015, SDM.

[18]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[19]  Gang Chen,et al.  Deep Learning with Nonparametric Clustering , 2015, ArXiv.

[20]  Yunchuan Sun,et al.  Adaptive fuzzy clustering by fast search and find of density peaks , 2015, 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things (IIKI).

[21]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[22]  Richi Nayak,et al.  Fine-grained document clustering via ranking and its application to social media analytics , 2018, Social Network Analysis and Mining.

[23]  Xiao Xu,et al.  An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood , 2017, Knowl. Based Syst..

[24]  Fanyu Bu A High-Order Clustering Algorithm Based on Dropout Deep Learning for Heterogeneous Data in Cyber-Physical-Social Systems , 2018, IEEE Access.

[25]  Carlos F.M. Coimbra,et al.  On the determination of coherent solar microclimates for utility planning and operations , 2014 .

[26]  Chun-Yan Han,et al.  Improved SLIC imagine segmentation algorithm based on K-means , 2017, Cluster Computing.

[27]  Bo Zhang,et al.  Discriminatively Boosted Image Clustering with Fully Convolutional Auto-Encoders , 2017, Pattern Recognit..

[28]  Chia-Wen Lin,et al.  CNN-Based Joint Clustering and Representation Learning with Feature Drift Compensation for Large-Scale Image Data , 2017, IEEE Transactions on Multimedia.

[29]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[30]  Liangzhong Shen,et al.  Clustering Mixed Data by Fast Search and Find of Density Peaks , 2017 .

[31]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[32]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[33]  Hong Jia,et al.  Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number , 2013, Pattern Recognit..

[34]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[35]  Anton V. Ushakov,et al.  Bi-level and Bi-objective p-Median Type Problems for Integrative Clustering: Application to Analysis of Cancer Gene-Expression and Drug-Response Data , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  Sotirios Chatzis,et al.  A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional , 2011, Expert Syst. Appl..

[37]  Claudia Plant,et al.  Parameter Free Mixed-Type Density-Based Clustering , 2018, DEXA.

[38]  Chung-Chian Hsu,et al.  Incremental clustering of mixed data based on distance hierarchy , 2008, Expert Syst. Appl..

[39]  Xiaodong Liu,et al.  A spectral clustering method with semantic interpretation based on axiomatic fuzzy set theory , 2018, Appl. Soft Comput..

[40]  Qing Yang,et al.  A novel DBSCAN with entropy and probability for mixed data , 2017, Cluster Computing.

[41]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[42]  Iman Gholampour,et al.  Cluster-based sparse topical coding for topic mining and document clustering , 2018, Adv. Data Anal. Classif..

[43]  Kaspar Althoefer,et al.  Knock-Knock: Acoustic object recognition by using stacked denoising autoencoders , 2017, Neurocomputing.

[44]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[45]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[46]  Zengyou He,et al.  Scalable algorithms for clustering large datasets with mixed type attributes , 2005, Int. J. Intell. Syst..

[47]  Gil David,et al.  SpectralCAT: Categorical spectral clustering of numerical and nominal data , 2012, Pattern Recognit..

[48]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[49]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[50]  Donald C. Wunsch,et al.  Clustering Data of Mixed Categorical and Numerical Type With Unsupervised Feature Learning , 2015, IEEE Access.

[51]  Yu Xue,et al.  A novel density peaks clustering algorithm for mixed data , 2017, Pattern Recognit. Lett..

[52]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[53]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[54]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[55]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[56]  A. Hoffman,et al.  Lower bounds for the partitioning of graphs , 1973 .