Efficient computation of normalized maximum likelihood coding for Gaussian mixtures with its applications to optimal clustering

This paper addresses the issue of estimating from a given data sequence the number of mixture components for a Gaussian mixture model. Our approach is to compute the normalized maximum likelihood (NML) code-length for the data sequence relative to a Gaussian mixture model, then to find the mixture size that attains the minimum of the NML. Here the minimization of the NML code-length is known as Rissanen's minimum description length (MDL) principle. For discrete domains, Kontkanen and Myllymäki proposed a method of efficient computation of the NML code-length for specific models, however, for continuous domains it has remained open how we compute the NML code-length efficiently. We propose a method for efficient computation of the NML code-length for Gaussian mixture models. We develop it using an approximation of the NML code-length under the restriction of the domain and using the technique of a generating function. We apply it to the issue of determining the optimal number of clusters in clustering using a Gaussian mixture model, where the mixture size is the number of clusters. We use artificial data sets and benchmark data sets to empirically demonstrate that our estimate of the mixture size converges to the true one significantly faster than AIC and BIC.