Iterative factor clustering of binary data

Binary data represent a very special condition where both measures of distance and co-occurrence can be adopted. Euclidean distance-based non-hierarchical methods, like the k-means algorithm, or one of its versions, can be profitably used. When the number of available attributes increases the global clustering performance usually worsens. In such cases, to enhance group separability it is necessary to remove the irrelevant and redundant noisy information from the data. The present approach belongs to the category of attribute transformation strategy, and combines clustering and factorial techniques to identify attribute associations that characterize one or more homogeneous groups of statistical units. Furthermore, it provides graphical representations that facilitate the interpretation of the results.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[3]  Heungsun Hwang,et al.  Simultaneous Two-Way Clustering of Multiple Correspondence Analysis , 2010, Multivariate behavioral research.

[4]  Hans-Joachim Mucha,et al.  An Intelligent Clustering Technique Based on Dual Scaling , 2002 .

[5]  P. Arabie,et al.  Cluster analysis in marketing research , 1994 .

[6]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[7]  Janice L. DuBien,et al.  A method of predicting the number of clusters using Rand's statistic , 2006, Comput. Stat. Data Anal..

[8]  H. Akaike A new look at the statistical model identification , 1974 .

[9]  S. Dolnicar,et al.  An examination of indexes for determining the number of clusters in binary data sets , 2002, Psychometrika.

[10]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[11]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[12]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Boris G. Mirkin,et al.  Choosing the number of clusters , 2011, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[15]  H. Kiers,et al.  Factorial k-means analysis for two-way data , 2001 .

[16]  Charu C. Aggarwal,et al.  An Introduction to Cluster Analysis , 2018, Data Clustering: Algorithms and Applications.

[17]  B. Mirkin Eleven Ways to Look at the Chi-Squared Coefficient for Contingency Tables , 2001 .

[18]  Heungsun Hwang,et al.  An Extension of Multiple Correspondence Analysis for Identifying Heterogeneous Subgroups of Respondents , 2006 .

[19]  Ludmila I. Kuncheva,et al.  Evaluation of Stability of k-Means Cluster Ensembles with Respect to Random Initialization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  S. Balbi,et al.  The analysis of structured qualitative data , 1999 .

[21]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[22]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[23]  Günther Palm,et al.  Multi-objective selection for collecting cluster alternatives , 2011, Comput. Stat..

[24]  Kashif Javed,et al.  Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data , 2012, IEEE Transactions on Knowledge and Data Engineering.

[25]  Roberta Siciliano,et al.  A fast splitting procedure for classification trees , 1997, Stat. Comput..

[26]  Maurizio Vichi,et al.  Clustering and disjoint principal component analysis , 2009, Comput. Stat. Data Anal..

[27]  B. Margolin,et al.  An Analysis of Variance for Categorical Data , 1971 .

[28]  A. Morineau,et al.  Multivariate descriptive statistical analysis , 1984 .

[29]  W. Heiser,et al.  Clusteringn objects intok groups under optimal scaling of variables , 1989 .

[30]  Thomas Nocke,et al.  Methods for the visualization of clustered climate data , 2004, Comput. Stat..