Clustering of Heterogeneously Typed Data with Soft Computing

The problem of finding clusters in arbitrary sets of data has been attempted using different approaches. In most cases, the use of metrics in order to determine the adequateness of the said clusters is assumed. That is, the criteria yielding a measure of quality of the clusters depends on the distance between the elements of each cluster. Typically, one considers a cluster to be adequately characterized if the elements within a cluster are close to one another while, simultaneously, they appear to be far from those of different clusters. This intuitive approach fails if the variables of the elements of a cluster are not amenable to distance measurements, i.e., if the vectors of such elements cannot be quantified. This case arises frequently in real world applications where several variables (if not most of them) correspond to categories. The usual tendency is to assign arbitrary numbers to every category: to encode the categories. This, however, may result in spurious patterns: relationships between the variables which are not really there at the offset. It is evident that there is no truly valid assignment which may ensure a universally valid numerical value to this kind of variables. But there is a strategy which guarantees that the encoding will, in general, not bias the results. In this paper we explore such strategy. We discuss the theoretical foundations of our approach and prove that this is the best strategy in terms of the statistical behavior of the sampled data. We also show that, when applied to a complex real world problem, it allows us to generalize soft computing methods to find the number and characteristics of a set of clusters. We contrast the characteristics of the clusters gotten from the automated method with those of the experts.

[1]  A. Agresti Categorical data analysis , 1993 .

[2]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[3]  Jimmy Johansson,et al.  Interactive Quantification of Categorical Variables in Mixed Data Sets , 2008, 2008 12th International Conference Information Visualisation.

[4]  Vipin Kumar,et al.  A Framework for Exploring Categorical Data , 2009, SDM.

[5]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[6]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[7]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[8]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[9]  Chia-Hui Chang,et al.  Categorical data visualization and clustering using subjective factors , 2005, Data Knowl. Eng..

[10]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[11]  Chung-Chian Hsu,et al.  An integrated framework for visualized and exploratory pattern discovery in mixed data , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Jeong-Hoon Lee,et al.  Clustering with Domain Value Dissimilarity for Categorical Data , 2009, ICDM.

[13]  Deniz Erdoğmuş,et al.  Clustering using Renyi's entropy , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[14]  Petra Perner,et al.  Advances in Data Mining , 2002, Lecture Notes in Computer Science.

[15]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[16]  Jinyuan You,et al.  CLOPE: a fast and effective clustering algorithm for transactional data , 2002, KDD.

[17]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[19]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[20]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[21]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[22]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[23]  Naren Ramakrishnan,et al.  Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[24]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[25]  Seungjin Choi,et al.  Minimum entropy, k-means, spectral clustering , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[26]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[27]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[28]  Keke Chen,et al.  Efficiently clustering transactional data with weighted coverage density , 2006, CIKM '06.

[29]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[30]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[31]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.