Learning Mixtures by Simplifying Kernel Density Estimators

Gaussian mixture models are a widespread tool for modeling various and complex probability density functions. They can be estimated by various means, often using Expectation–Maximization or Kernel Density Estimation. In addition to these well known algorithms, new and promising stochastic modeling methods include Dirichlet Process mixtures and k-Maximum Likelihood Estimators. Most of the methods, including Expectation–Maximization, lead to compact models but may be expensive to compute. On the other hand Kernel Density Estimation yields to large models which are computationally cheap to build. In this chapter we present new methods to get high-quality models that are both compact and fast to compute. This is accomplished by the simplification of Kernel Density Estimator. The simplification is a clustering method based on k-means-like algorithms. Like all k-means algorithms, our method rely on divergences and centroids computation and we use two different divergences (and their associated centroids), Bregman and . Along with the description of the algorithms, we describe the pyMEF =library=, which is a Python library designed for the manipulation of mixture of exponential families. Unlike most of the other existing tools, this library allows to use any exponential family instead of being limited to a particular distribution. The generic library allows to rapidly explore the different available exponential families in order to choose the better suited for a particular application. We evaluate the proposed algorithms by building mixture models on examples from a bio-informatics application. The quality of the resulting models is measured in terms of log-likelihood and of Kullback–Leibler divergence.

[1]  Frank Nielsen,et al.  Model centroids for the simplification of Kernel Density estimators , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sueli I. Rodrigues Costa,et al.  Fisher information matrix and hyperbolic geometry , 2005, IEEE Information Theory Workshop, 2005..

[3]  Benjamin Georgi,et al.  PyMix - The Python mixture package - a tool for clustering of heterogeneous biological data , 2010, BMC Bioinformatics.

[4]  R. Kass,et al.  Geometrical Foundations of Asymptotic Inference , 1997 .

[5]  G. Gal'perin A concept of the mass center of a system of material points in the constant curvature spaces , 1993 .

[6]  Frank Nielsen,et al.  Bhattacharyya Clustering with Applications to Mixture Simplifications , 2010, 2010 20th International Conference on Pattern Recognition.

[7]  Bruno Pelletier,et al.  Informative barycentres in statistics , 2005 .

[8]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Frank Nielsen,et al.  Hyperbolic Voronoi Diagrams Made Easy , 2009, 2010 International Conference on Computational Science and Its Applications.

[11]  N. Čencov Statistical Decision Rules and Optimal Inference , 2000 .

[12]  Frank Nielsen,et al.  Statistical exponential families: A digest with flash cards , 2009, ArXiv.

[13]  Adelene Y. L. Sim,et al.  Fully differentiable coarse-grained and all-atom knowledge-based potentials for RNA structure evaluation. , 2011, RNA.

[14]  Frank Nielsen,et al.  Levels of Details for Gaussian Mixture Models , 2009, ACCV.

[15]  Itay Mayrose,et al.  A Gamma mixture model better accounts for among site rate heterogeneity , 2005, ECCB/JBI.

[16]  Josep M. Oller,et al.  Computing the Rao distance for gamma distributions , 2003 .

[17]  Yuan Ji,et al.  Applications of beta-mixture models in bioinformatics , 2005, Bioinform..

[18]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[19]  Gérard Govaert,et al.  Model-based cluster and discriminant analysis with the MIXMOD software , 2006, Comput. Stat. Data Anal..

[20]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[21]  Xiaohu Guo,et al.  Hyperbolic centroidal Voronoi tessellation , 2010, SPM '10.

[22]  Frank Nielsen,et al.  Jensen-Bregman Voronoi Diagrams and Centroidal Tessellations , 2010, 2010 International Symposium on Voronoi Diagrams in Science and Engineering.

[23]  Olivier Schwander,et al.  Evaluating Mixture Models for Building RNA Knowledge-Based Potentials , 2012, J. Bioinform. Comput. Biol..

[24]  L. Brown Fundamentals of statistical exponential families: with applications in statistical decision theory , 1986 .

[25]  Petia Radeva,et al.  Rayleigh Mixture Model for Plaque Characterization in Intravascular Ultrasound , 2011, IEEE Transactions on Biomedical Engineering.

[26]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .