Visualization and clustering of categorical data with probabilistic self-organizing map

This paper introduces a self-organizing map dedicated to clustering, analysis and visualization of categorical data. Usually, when dealing with categorical data, topological maps use an encoding stage: categorical data are changed into numerical vectors and traditional numerical algorithms (SOM) are run. In the present paper, we propose a novel probabilistic formalism of Kohonen map dedicated to categorical data where neurons are represented by probability tables. We do not need to use any coding to encode variables. We evaluate the effectiveness of our model in four examples using real data. Our experiments show that our model provides a good quality of results when dealing with categorical data.

[1]  Christopher M. Bishop,et al.  A Hierarchical Latent Variable Model for Data Visualization , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Mustapha Lebbah,et al.  Topological map for binary data , 2000, ESANN.

[3]  S. P. Luttrell,et al.  A Bayesian Analysis of Self-Organizing Maps , 1994, Neural Computation.

[4]  Chung-Chian Hsu,et al.  GViSOM for Multivariate Mixed Data Projection and Structure Visualization , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[5]  Jouko Lampinen,et al.  On the generative probability density model in the self-organizing map , 2002, Neurocomputing.

[6]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[7]  Eric Saund,et al.  A Multiple Cause Mixture Model for Unsupervised Learning , 1995, Neural Computation.

[8]  Mark A. Girolami,et al.  The topographic organization and visualization of binary data using multivariate-Bernoulli latent variable models , 2001, IEEE Trans. Neural Networks.

[9]  Mustapha Lebbah,et al.  Mixed Topological Map , 2005, ESANN.

[10]  Gilles Celeux,et al.  EM procedures using mean field-like approximations for Markov model-based image segmentation , 2003, Pattern Recognit..

[11]  Friedrich Leisch,et al.  Competitive Learning for Binary Valued Data , 1998 .

[12]  Michael E. Tipping Probabilistic Visualisation of High-Dimensional Binary Data , 1998, NIPS.

[13]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[14]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[15]  Samuel Kaski,et al.  Self-Organized Formation of Various Invariant-Feature Filters in the Adaptive-Subspace SOM , 1997, Neural Computation.

[16]  Tom Heskes,et al.  Self-organizing maps, vector quantization, and mixture modeling , 2001, IEEE Trans. Neural Networks.

[17]  Marie Cottrell,et al.  Multiple correspondence analysis of a crosstabulations matrix using the Kohonen algorithm , 1995, ESANN.

[18]  Mustapha Lebbah,et al.  BeSOM : Bernoulli on Self-Organizing Map , 2007, 2007 International Joint Conference on Neural Networks.

[19]  Ben J. A. Kröse,et al.  Self-organizing mixture models , 2005, Neurocomputing.

[20]  G. Govaert,et al.  Clustering for binary data and mixture models—choice of the model , 1997 .

[21]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[22]  Christian Buchta,et al.  A comparison of several cluster algorithms on artificial binary data [Part 1]. Scenarios from travel market segmentation [Part 2: Working Paper 19]. , 1998 .

[23]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[24]  Jian Yin,et al.  Clustering Mixed Type Attributes in Large Dataset , 2005, ISPA.

[25]  Xiaogang Wang,et al.  Bi-level clustering of mixed categorical and numerical biomedical data , 2006, Int. J. Data Min. Bioinform..

[26]  Klaus Obermayer,et al.  Self-organizing maps: Generalizations and new optimization techniques , 1998, Neurocomputing.

[27]  Ata Kabán,et al.  A Combined Latent Class and Trait Model for the Analysis and Visualization of Discrete Data , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Fouad Badran,et al.  Probabilistic self-organizing map and radial basis function networks , 1998, Neurocomputing.

[30]  M. Nadif,et al.  Speed-up for the expectation-maximization algorithm for clustering categorical data , 2007, J. Glob. Optim..

[31]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.