Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

A parameter-free, fully-automatic approach to clustering high-dimensional categorical data is proposed. The technique is based on a two-phase iterative procedure, which attempts to improve the overall quality of the whole partition. In the first phase, cluster assignments are given, and a new cluster is added to the partition by identifying and splitting a low-quality cluster. In the second phase, the number of clusters is fixed, and an attempt to optimize cluster assignments is done. On the basis of such features, the algorithm attempts to improve the overall quality of the whole partition and finds clusters in the data, whose number is naturally established on the basis of the inherent features of the underlying data set rather than being previously specified. Furthermore, the approach is parametric to the notion of cluster quality: Here, a cluster is defined as a set of tuples exhibiting a sort of homogeneity. We show how a suitable notion of cluster homogeneity can be defined in the context of high-dimensional categorical data, from which an effective instance of the proposed clustering scheme immediately follows. Experiments on both synthetic and real data prove that the devised algorithm scales linearly and achieves nearly optimal results in terms of compactness and separation.

[1]  Jinyuan You,et al.  CLOPE: a fast and effective clustering algorithm for transactional data , 2002, KDD.

[2]  Igor Jurisica,et al.  Binary tree-structured vector quantization approach to clustering and visualizing microarray data , 2002, ISMB.

[3]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[4]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[5]  LiuHuan,et al.  Subspace clustering for high dimensional data , 2004 .

[6]  Jianhong Wu,et al.  Subspace clustering for high dimensional categorical data , 2004, SKDD.

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[9]  G. Karypis,et al.  Clustering In A High-Dimensional Space Using Hypergraph Models , 2004 .

[10]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[11]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[12]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[13]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[14]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[15]  Cevdet Aykanat,et al.  Hypergraph Models and Algorithms for Data-Pattern-Based Clustering , 2004, Data Mining and Knowledge Discovery.

[16]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[17]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[18]  Jayanta Basak,et al.  Interpretable hierarchical clustering by constructing an unsupervised decision tree , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Fosca Giannotti,et al.  Clustering Transactional Data , 2002, PKDD.

[20]  Philip S. Yu,et al.  Clustering through decision tree construction , 2000, CIKM '00.

[21]  Mohammed J. Zaki,et al.  CLICKS: Mining Subspace Clusters in Categorical Data via K-Partite Maximal Cliques , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[23]  P. Deb Finite Mixture Models , 2008 .

[24]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[25]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[26]  Tao Li,et al.  Entropy-based criterion in categorical clustering , 2004, ICML.

[27]  Miguel Á. Carreira-Perpiñán,et al.  Practical Identifiability of Finite Mixtures of Multivariate Bernoulli Distributions , 2000, Neural Computation.

[28]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[29]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[30]  HalkidiMaria,et al.  Cluster validity methods , 2002 .

[31]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[32]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[33]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[34]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[35]  Heikki Mannila,et al.  Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction , 2001, KDD '01.

[36]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[37]  Andreas Rudolph,et al.  Techniques of Cluster Algorithms in Data Mining , 2002, Data Mining and Knowledge Discovery.

[38]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[39]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[40]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[41]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[42]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[43]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[44]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.