A New Context-Based Clustering Framework for Categorical Data

Clustering is a fundamental task that has been utilized in many scientific fields, especially in machine learning and data mining. In clustering, dissimilarity measures play a key role in formulating clusters. For handling categorical values, the simple matching method is usually used for quantifying their dissimilarity. However, this method cannot capture the hidden semantic information that can be inferred from relationships among categories. In this paper, we propose a new clustering framework for categorical data that is capable of integrating not only the distributions of categories but also their mutual relationship information into the pattern proximity evaluation process of the clustering task. The effectiveness of the proposed clustering algorithm is proven by a comparative study conducted on existing clustering methods for categorical data.

[1]  Shengrui Wang,et al.  Central Clustering of Categorical Data with Automated Feature Weighting , 2013, IJCAI.

[2]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[3]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[4]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[5]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[6]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[7]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ruggero G. Pensa,et al.  From Context to Distance: Learning Dissimilarity for Categorical Data Clustering , 2012, TKDD.

[9]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[14]  A.K.C. Wong,et al.  Attribute clustering for grouping, selection, and classification of gene expression data , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Van-Nam Huynh,et al.  A New Context-Based Similarity Measure for Categorical Data Using Information Theory , 2018, IUKM.

[16]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[17]  Van-Nam Huynh,et al.  A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure , 2016, FoIKS.

[18]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .