Mixtures of Probabilistic Principal Component Analysers

Let us formalize PCA in the following way. Given observed points {tn}, n ∈ {1 . . . N}, of dimension d, it aims at finding a number q < d of orthonormal axes (thus forming a linear subspace of dimension q < d) such that the variance of the projection of the observed vectors onto this subspace is maximal. The idea is that the directions along which the variance of the observed data is maximal are those which carry the most information about the individual observations, and should therefore be preserved as such to discriminate the observations. On the other hand, the directions along which the observed variance is minimal give little information about the individual observations: all observed vectors are ”roughly the same” along this direction, hence this axe rather gives information about the structure of the problem; this information can be stored in the form of the lower-dimensional linear subspace, and forgotten at the level of individual observations.