Grassmann clustering

An important tool in high-dimensional, explorative data mining is given by clustering methods. They aim at identifying samples or regions of similar characteristics, and often code them by a single codebook vector or centroid. One of the most commonly used partitional clustering techniques is the k-means algorithm, which in its batch form partitions the data set into k disjoint clusters by simply iterating between cluster assignments and cluster updates. The latter step implies calculating a new centroid within each cluster. We generalize the concept of k-means by applying it not to the standard Euclidean space but to the manifold of subvectorspaces of a fixed dimension, also known as the Grassmann manifold. Important examples include projective space i.e. the manifold of lines and the space of all hyperplanes. Detecting clusters in multiple samples drawn from a Grassmannian is a problem arising in various applications. In this manuscript, we provide corresponding metrics for a Grassmann k-means algorithm, and solve the centroid calculation problem explicitly in closed form. An application to nonnegative matrix factorization illustrates the feasibility of the proposed algorithm.