Principal components analysis (PCA) is a dimensionality reduction technique that can be used to give a compact representation of data while minimising information loss. Suppose we are given a set of data, represented as vectors in a high-dimensional space. It may be that many of the variables are correlated and that the data closely fits a lower dimensional linear manifold. In this case, PCA finds such a lower dimensional representation in terms of uncorrelated variables called principal components. PCA can also be kernelised, allowing it to be used to fit data to low-dimensional non-linear manifolds. Besides dimensionality reduction, PCA can also uncover patterns in data and lead to a potentially less noisy and more informative representation. Often one applies PCA to prepare data for further analysis, e.g., finding nearest neighbours or clustering. In a nutshell, PCA proceeds as follows. We are given a collection of data in the form of n vectors x1, . . . ,xn ∈ Rm. By first translating the data vectors, if necessary, we may assume that the input data are mean centred, that is, ∑n i=1 xi = 0. Given a target number of dimensions k m PCA aims to find an orthonormal family of k vectors u1, . . . ,uk ∈ Rm that “explain most of the variation in the data”. More precisely, for i = 1, . . . , n we approximate each data point xi by a linear expression zi1u1 + . . . + zikuk for some scalars zi1, . . . , zik ∈ R; the goal of PCA is to choose the ui so as to optimise the quality of this approximation over all data points. The optimal such vectors u1, . . . ,uk are the k principal components: u1 is direction of greatest variance in the data, u2 is the direction of greatest variance that is orthogonal to u1, etc. To find the principal components we apply a matrix factorisation technique—the singular value decomposition— to the m × n matrix whose columns are the mean-centred data points xi. In the end, representing each data point xi ∈ Rm by its coordinates zi = (zi1, . . . , zik) with respect to the k principal components yields a lower dimensional and (hopefully) more informative representation.
[1]
D. O. Parsons.
Quit Rates Over Time: A Search and Information Approach
,
1973
.
[2]
Richard S. Toikka.
A Markovian Model of Labor Market Decisions by Workers
,
1976
.
[3]
P. Barth.
A Time Series Analysis of Layoff Rates
,
1971
.
[4]
R. Topel.
Inventories, Layoffs, and the Short-Run Demand for Labor
,
1982
.
[5]
K. Clark,et al.
Demographic Differences in Cyclical Employment Variation
,
1980
.
[6]
T. Dernburg,et al.
Cyclical Variation in Civilian Labor Force Participation
,
1964
.
[7]
Stephen T. Marston.
Employment Instability and High Unemployment Rates
,
1976
.
[8]
Hirschel Kasper,et al.
The Asking Price of Labor and the Duration of Unemployment
,
1967
.
[9]
C. C. Holt,et al.
Demand for Labor in a Dynamic Theory of the Firm
,
1975
.
[10]
H. Theil.
Principles of econometrics
,
1971
.
[11]
G. Perry.
Unemployment Flows in the U.S. Labor Market
,
1972
.
[12]
Nicholas M. Kiefer,et al.
An Empirical Job-Search Model, with a Test of the Constant Reservation-Wage Hypothesis
,
1979,
Journal of Political Economy.
[13]
Harry C. Benham.
Layoff and Recall Impacts of Stocks of Rehirables
,
1980
.
[14]
Reuben Gronau,et al.
Information and Frictional Unemployment
,
1971
.
[15]
J. E. Freund,et al.
Modern elementary statistics
,
1953
.