Bayesian unsupervised classification framework based on stochastic partitions of data and a parallel search strategy

Advantages of statistical model-based unsupervised classification over heuristic alternatives have been widely demonstrated in the scientific literature. However, the existing model-based approaches are often both conceptually and numerically instable for large and complex data sets. Here we consider a Bayesian model-based method for unsupervised classification of discrete valued vectors, that has certain advantages over standard solutions based on latent class models. Our theoretical formulation defines a posterior probability measure on the space of classification solutions corresponding to stochastic partitions of observed data. To efficiently explore the classification space we use a parallel search strategy based on non-reversible stochastic processes. A decision-theoretic approach is utilized to formalize the inferential process in the context of unsupervised classification. Both real and simulated data sets are used for the illustration of the discussed methods.

[1]  J. Corander,et al.  Random Partition Models and Exchangeability for Bayesian Identification of Population Structure , 2007, Bulletin of mathematical biology.

[2]  Hans-Hermann Bock,et al.  Probabilistic Models in Partitional Cluster Analysis , 2003 .

[3]  Jukka Corander,et al.  Bayesian analysis of population structure based on linked molecular information. , 2007, Mathematical biosciences.

[4]  David G. Stork,et al.  Pattern Classification , 1973 .

[5]  Bernard Van Cutsem,et al.  Combinatorial structures and structures for classification , 1996 .

[6]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[7]  B. Gidas Nonstationary Markov chains and convergence of the annealing algorithm , 1985 .

[8]  Jukka Corander,et al.  Parallell interacting MCMC for learning of topologies of graphical models , 2008, Data Mining and Knowledge Discovery.

[9]  Olle Häggström Finite Markov Chains and Algorithmic Applications , 2002 .

[10]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[11]  Ehl Emile Aarts,et al.  Simulated annealing and Boltzmann machines , 2003 .

[12]  Wilfred Perks,et al.  Some observations on inverse probability including a new indifference rule , 1947 .

[13]  Jukka Corander,et al.  BAPS 2: enhanced possibilities for the analysis of genetic population structure , 2004, Bioinform..

[14]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[15]  M. Schervish Theory of Statistics , 1995 .

[16]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[17]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[18]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[19]  M. Verlaan,et al.  Classification of Binary Vectors by Stochastic Complexity , 1997 .

[20]  Jukka Corander,et al.  Bayesian search of functionally divergent protein subgroups and their function specific residues , 2006, Bioinform..

[21]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[22]  Richard W. Madsen,et al.  Markov Chains: Theory and Applications , 1976 .

[23]  Scott A. Sisson,et al.  Transdimensional Markov Chains , 2005 .

[24]  S. Zabell W. E. Johnson's "Sufficientness" Postulate , 1982 .

[25]  Mats Gyllenberg,et al.  Bayesian model learning based on a parallel MCMC strategy , 2006, Stat. Comput..

[26]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[27]  Mats Gyllenberg,et al.  Classification of Enterobacteriaceae by minimization of stochastic complexity. , 1997, Microbiology.

[28]  Ramón López de Mántaras,et al.  TAN Classifiers Based on Decomposable Distributions , 2005, Machine Learning.

[29]  T. Koski,et al.  Probabilistic Models for Bacterial Taxonomy , 2000 .

[30]  K J Dawson,et al.  A Bayesian approach to the identification of panmictic populations and the assignment of individuals. , 2001, Genetical research.

[31]  T. Koski,et al.  Bayesian predictiveness, exchangeability and sufficientness in bacterial taxonomy. , 2002, Mathematical biosciences.

[32]  L. Hubert,et al.  Comparing partitions , 1985 .