Model-based cluster analysis

Abstract The problem of dot clustering is studied from a model-based viewpoint. A set of “placement” processes is chosen, each of which associates a probability with each location in a discrete space; in other words, a placement is a probability mass function (pmf) on the space. A number of dots is then distributed in accordance with each of these pmfs; the pmf and its associated cardinality define a subpopulation of dots. This model is extremely general; the pmfs are arbitrary. Given a set of dots generated by such a model, maximum a posteriori (MAP) methods are applied to recover the most likely set of placements and cardinalities that could have given rise to the dots. This identification problem is different from the partitioning problem, which asks for the most likely partition of the dot population into subpopulations. It is shown how and why MAP methods are useful in cluster analysis, especially when the placement pmfs are non-Gaussian. It is also shown that although the general identification problem is intractable, there is a polynomial time solution if the number of subpopulations is bounded. It is shown that a similar result holds for the partitioning problem.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  Drew McDermott,et al.  Introduction to artificial intelligence , 1986, Addison-Wesley series in computer science.

[5]  Azriel Rosenfeld,et al.  Fuzzy Digital Topology , 1979, Inf. Control..

[6]  F. Attneave Some informational aspects of visual perception. , 1954, Psychological review.

[7]  K. Koffka Principles Of Gestalt Psychology , 1936 .

[8]  D. Binder Bayesian cluster analysis , 1978 .

[9]  Emanuel Parzen,et al.  Stochastic Processes , 1962 .

[10]  R. Sorkin A quantitative occam's razor , 1983, astro-ph/0511780.