Fast, Large-Scale Transformation-Invariant Clustering

In previous work on "transformed mixtures of Gaussians" and "transformed hidden Markov models", we showed how the EM algorithm in a discrete latent variable model can be used to jointly normalize data (e.g., center images, pitch-normalize spectrograms) and learn a mixture model of the normalized data. The only input to the algorithm is the data, a list of possible transformations, and the number of clusters to find. The main criticism of this work was that the exhaustive computation of the posterior probabilities over transformations would make scaling up to large feature vectors and large sets of transformations intractable. Here, we describe how a tremendous speed-up is acheived through the use of a variational technique for decoupling transformations, and a fast Fourier transform method for computing posterior probabilities. For N × N images, learning C clusters under N rotations, N scales, N x-translations and N y-translations takes only (C + 2 log N)N2 scalar operations per iteration. In contrast, the original algorithm takes CN6 operations to account for these transformations. We give results on learning a 4-component mixture model from a video sequence with frames of size 320×240. The model accounts for 360 rotations and 76,800 translations. Each iteration of EM takes only 10 seconds per frame in MATLAB, which is over 5 million times faster than the original algorithm.