Principal Component Analysis

Principal components analysis (PCA) is a dimensionality reduction technique that can be used to give a compact representation of data while minimising information loss. Suppose we are given a set of data, represented as vectors in a high-dimensional space. It may be that many of the variables are correlated and that the data closely fits a lower dimensional linear manifold. In this case, PCA finds such a lower dimensional representation in terms of uncorrelated variables called principal components. PCA can also be kernelised, allowing it to be used to fit data to low-dimensional non-linear manifolds. Besides dimensionality reduction, PCA can also uncover patterns in data and lead to a potentially less noisy and more informative representation. Often one applies PCA to prepare data for further analysis, e.g., finding nearest neighbours or clustering. In a nutshell, PCA proceeds as follows. We are given a collection of data in the form of n vectors x1, . . . ,xn ∈ Rm. By first translating the data vectors, if necessary, we may assume that the input data are mean centred, that is, ∑n i=1 xi = 0. Given a target number of dimensions k m PCA aims to find an orthonormal family of k vectors u1, . . . ,uk ∈ Rm that “explain most of the variation in the data”. More precisely, for i = 1, . . . , n we approximate each data point xi by a linear expression zi1u1 + . . . + zikuk for some scalars zi1, . . . , zik ∈ R; the goal of PCA is to choose the ui so as to optimise the quality of this approximation over all data points. The optimal such vectors u1, . . . ,uk are the k principal components: u1 is direction of greatest variance in the data, u2 is the direction of greatest variance that is orthogonal to u1, etc. To find the principal components we apply a matrix factorisation technique—the singular value decomposition— to the m × n matrix whose columns are the mean-centred data points xi. In the end, representing each data point xi ∈ Rm by its coordinates zi = (zi1, . . . , zik) with respect to the k principal components yields a lower dimensional and (hopefully) more informative representation.