Clustering Mixed-Type Data with Correlation-Preserving Embedding

Mixed-type data that contains both categorical and numerical features is prevalent in many real-world applications. Clustering mixed-type data is challenging, especially because of the complex relationship between categorical and numerical features. Unfortunately, widely adopted encoding methods and existing representation learning algorithms fail to capture these complex relationships. In this paper, we propose a new correlation-preserving embedding framework, COPE, to learn the representation of categorical features in mixed-type data while preserving the correlation between numerical and categorical features. Our extensive experiments with real-world datasets show that COPE generates high-quality representations and outperforms the state-of-the-art clustering algorithms by a wide margin.

[1]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[2]  Christian Böhm,et al.  Clustering of Mixed-Type Data Considering Concept Hierarchies , 2019, PAKDD.

[3]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[4]  Robert F. Tate,et al.  Correlation Between a Discrete and a Continuous Variable. Point-Biserial Correlation , 1954 .

[5]  Donald F. Specht,et al.  A general regression neural network , 1991, IEEE Trans. Neural Networks.

[6]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[7]  Kai Lu,et al.  Metric-Based Auto-Instructor for Learning Mixed Data Representation , 2018, AAAI.

[8]  Claudia Plant,et al.  Parameter Free Mixed-Type Density-Based Clustering , 2018, DEXA.

[9]  Mark de Reuver,et al.  Mobile customer segmentation based on smartphone measurement , 2014, Telematics Informatics.

[10]  Christian Böhm,et al.  Integrative Parameter-Free Clustering of Data with Mixed Type Attributes , 2010, PAKDD.

[11]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[12]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Huilong Duan,et al.  Multiple fuzzy c-means clustering algorithm in medical diagnosis. , 2015, Technology and health care : official journal of the European Society for Engineering and Medicine.

[16]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[17]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[18]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[20]  Jacob Benesty,et al.  Pearson Correlation Coefficient , 2009 .

[21]  Gerhard Nahler,et al.  Pearson Correlation Coefficient , 2020, Definitions.

[22]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[23]  Jan Baumbach,et al.  Comparing the performance of biomedical clustering methods , 2015, Nature Methods.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[26]  Zied Chtourou,et al.  A fast and effective partitional clustering algorithm for large categorical datasets using a k-means based approach , 2018, Comput. Electr. Eng..

[27]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[28]  Erik Marchi,et al.  A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Yunqian Ma,et al.  Practical selection of SVM parameters and noise estimation for SVM regression , 2004, Neural Networks.

[30]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[31]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[32]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[33]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Chao Chen,et al.  Composing Tree Graphical Models with Persistent Homology Features for Clustering Mixed-Type Data , 2017, ICML.

[36]  Jiye Liang,et al.  An Algorithm for Clustering Categorical Data With Set-Valued Features , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[37]  M. Ankerst,et al.  OPTICS: ordering points to identify the clustering structure , 1999, ACM SIGMOD Conference.

[38]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[39]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[40]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[42]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[43]  Francesco Cricri,et al.  Clustering and Unsupervised Anomaly Detection with l2 Normalized Deep Auto-Encoder Representations , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).