An Ordinal Data Clustering Algorithm with Automated Distance Learning

Clustering ordinal data is a common task in data mining and machine learning fields. As a major type of categorical data, ordinal data is composed of attributes with naturally ordered possible values (also called categories interchangeably in this paper). However, due to the lack of dedicated distance metric, ordinal categories are usually treated as nominal ones, or coded as consecutive integers and treated as numerical ones. Both these two common ways will roughly define the distances between ordinal categories because the former way ignores the order relationship and the latter way simply assigns identical distances to different pairs of adjacent categories that may have intrinsically unequal distances. As a result, they may produce unsatisfactory ordinal data clustering results. This paper, therefore, proposes a novel ordinal data clustering algorithm, which iteratively learns: 1) The partition of ordinal dataset, and 2) the inter-category distances. To the best of our knowledge, this is the first attempt to dynamically adjust inter-category distances during the clustering process to search for a better partition of ordinal data. The proposed algorithm features superior clustering accuracy, low time complexity, fast convergence, and is parameter-free. Extensive experiments show its efficacy.

[1]  Yong-Yeol Ahn,et al.  The Impact of Random Models on Clustering Similarity , 2017, bioRxiv.

[2]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[3]  Hong Jia,et al.  Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number , 2013, Pattern Recognit..

[4]  Hong Jia,et al.  Subspace Clustering of Categorical and Numerical Data With an Unknown Number of Clusters , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[5]  A. Agresti An introduction to categorical data analysis , 1997 .

[6]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[7]  Tu Bao Ho,et al.  for categorical data , 2005 .

[8]  Tao Li,et al.  Entropy-based criterion in categorical clustering , 2004, ICML.

[9]  Yiu-ming Cheung,et al.  Exploiting Order Information Embedded in Ordered Categories for Ordinal Data Clustering , 2018, ISMIS.

[10]  Hong Jia,et al.  A New Distance Metric for Unsupervised Learning of Categorical Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Lipika Dey,et al.  A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set , 2007, Pattern Recognit. Lett..

[13]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[14]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[17]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[18]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[19]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Ruggero G. Pensa,et al.  From Context to Distance: Learning Dissimilarity for Categorical Data Clustering , 2012, TKDD.