Fast progressive training of mixture models for model selection

Finite mixture models (FMM) are flexible models with varying uses such as density estimation, clustering, classification, modeling heterogeneity, model averaging, and handling missing data. Expectation maximization (EM) algorithm can learn the maximum likelihood estimates for the model parameters. One of the prerequisites for using the EM algorithm is the a priori knowledge of the number of mixture components in the mixture model. However, the number of mixing components is often unknown. Therefore, determining the number of mixture components has been a central problem in mixture modelling. Thus, mixture modelling is often a two-stage process of determining the number of mixture components and then estimating the parameters of the mixture model. This paper proposes a fast training of a series of mixture models using progressive merging of mixture components to facilitate model selection algorithm to make appropriate choice of the model. The paper also proposes a data driven, fast approximation of the Kullback–Leibler (KL) divergence as a criterion to measure the similarity of the mixture components. We use the proposed methodology in mixture modelling of a synthetic dataset, a publicly available zoo dataset, and two chromosomal aberration datasets showing that model selection is efficient and effective.

[1]  Prem Raj Adhikari,et al.  Preservation of Statistically Significant Patterns in Multiresolution 0-1 Data , 2010, PRIB.

[2]  Jaakko Hollmén,et al.  Sequential input selection algorithm for long-term prediction of time series , 2008, Neurocomputing.

[3]  Tao Li,et al.  A general model for clustering binary data , 2005, KDD '05.

[4]  Jaakko Hollmén,et al.  Mixture Modeling of DNA Copy Number Amplification Patterns in Cancer , 2007, IWANN.

[5]  Geoffrey E. Hinton,et al.  SMEM Algorithm for Mixture Models , 1998, Neural Computation.

[6]  H. Akaike A new look at the statistical model identification , 1974 .

[7]  B. Park,et al.  Estimation of Kullback–Leibler Divergence by Local Likelihood , 2006 .

[8]  Kim-Anh Do,et al.  Biomarker expression patterns that correlate with high grade features in treatment naive, organ-confined prostate cancer , 2008, BMC Medical Genomics.

[9]  Jaakko Hollmén,et al.  Compact and Understandable Descriptions of Mixtures of Bernoulli Distributions , 2007, IDA.

[10]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[11]  M. Baudis Genomic imbalances in 5918 malignant epithelial tumors: an explorative meta-analysis of chromosomal CGH data , 2007, BMC Cancer.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Isaac E. Lagaris,et al.  Split-Merge Incremental LEarning (SMILE) of Mixture Models , 2007, ICANN.

[14]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[15]  Sanjeev R. Kulkarni,et al.  Universal estimation of divergence for continuous distributions via data-dependent partitions , 2005, Proceedings. International Symposium on Information Theory, 2005. ISIT 2005..

[16]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[17]  B. Everitt,et al.  Finite Mixture Distributions , 1981 .

[18]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[19]  Sanjeev R. Kulkarni,et al.  Universal Divergence Estimation for Finite-Alphabet Sources , 2006, IEEE Transactions on Information Theory.

[20]  Zhihua Zhang,et al.  EM algorithms for Gaussian mixtures with split-and-merge operation , 2003, Pattern Recognit..

[21]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[22]  Lei Li,et al.  A Novel Split and Merge EM Algorithm for Gaussian Mixture Model , 2009, 2009 Fifth International Conference on Natural Computation.

[23]  Adele Cutler,et al.  Information Ratios for Validating Mixture Analysis , 1992 .

[24]  Shiri Gordon,et al.  An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  Yan Li,et al.  A split and merge EM algorithm for color image segmentation , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[26]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Fernando Pérez-Cruz,et al.  Kullback-Leibler divergence estimation of continuous distributions , 2008, 2008 IEEE International Symposium on Information Theory.

[28]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[29]  Changshui Zhang,et al.  Competitive EM algorithm for finite mixture models , 2004, Pattern Recognit..

[30]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[31]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[32]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[33]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[34]  Prem Raj Adhikari,et al.  Patterns from multiresolution 0-1 data , 2010, UP '10.

[35]  L. R. Rabiner,et al.  A probabilistic distance measure for hidden Markov models , 1985, AT&T Technical Journal.

[36]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[37]  Prem Raj Adhikari,et al.  Fast Progressive Training of Mixture Models for Model Selection , 2012, Discovery Science.

[38]  S. Knuutila,et al.  Classification of human cancers based on DNA copy number amplification modeling , 2008, BMC Medical Genomics.