Minimum Information Loss Cluster Analysis for Categorical Data

The EM algorithm has been used repeatedly to identify latent classes in categorical data by estimating finite distribution mixtures of product components. Unfortunately, the underlying mixtures are not uniquely identifiable and, moreover, the estimated mixture parameters are starting-point dependent. For this reason we use the latent class model only to define a set of "elementary" classes by estimating a mixture of a large number components. We propose a hierarchical "bottom up" cluster analysis based on unifying the elementary latent classes sequentially. The clustering procedure is controlled by minimum information loss criterion.

[1]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[2]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[3]  W. Blischke Estimating the Parameters of Mixtures of Binomial Distributions , 1964 .

[4]  J. Vermunt,et al.  Latent class cluster analysis , 2002 .

[5]  Paul F. Lazarsfeld,et al.  Latent Structure Analysis. , 1969 .

[6]  M. Verlaan,et al.  Non-uniqueness in probabilistic numerical identification of bacteria , 1994, Journal of Applied Probability.

[7]  Michal Haindl,et al.  Texture modelling by discrete distribution mixtures , 2003, Comput. Stat. Data Anal..

[8]  Josef Kittler,et al.  Multiple Classifier Fusion in Probabilistic Neural Networks , 2002, Pattern Analysis & Applications.

[9]  Miguel Á. Carreira-Perpiñán,et al.  Practical Identifiability of Finite Mixtures of Multivariate Bernoulli Distributions , 2000, Neural Computation.

[10]  Neil Henry Latent structure analysis , 1969 .

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  P. Somol A Statistical Approach to Local Evaluation of a Single Texture Image , 2007 .

[13]  Jirí Grim,et al.  EM Cluster Analysis for Categorical Data , 2006, SSPR/SPR.

[14]  H. Teicher Identifiability of Mixtures of Product Measures , 1967 .

[15]  P. Suppes A Probabilistic Theory Of Causality , 1970 .