论文信息 - Principal Component Analysis

Principal Component Analysis

Principal components analysis (PCA) is a dimensionality reduction technique that can be used to give a compact representation of data while minimising information loss. Suppose we are given a set of data, represented as vectors in a high-dimensional space. It may be that many of the variables are correlated and that the data closely fits a lower dimensional linear manifold. In this case, PCA finds such a lower dimensional representation in terms of uncorrelated variables called principal components. PCA can also be kernelised, allowing it to be used to fit data to low-dimensional non-linear manifolds. Besides dimensionality reduction, PCA can also uncover patterns in data and lead to a potentially less noisy and more informative representation. Often one applies PCA to prepare data for further analysis, e.g., finding nearest neighbours or clustering. In a nutshell, PCA proceeds as follows. We are given a collection of data in the form of n vectors x1, . . . ,xn ∈ Rm. By first translating the data vectors, if necessary, we may assume that the input data are mean centred, that is, ∑n i=1 xi = 0. Given a target number of dimensions k m PCA aims to find an orthonormal family of k vectors u1, . . . ,uk ∈ Rm that “explain most of the variation in the data”. More precisely, for i = 1, . . . , n we approximate each data point xi by a linear expression zi1u1 + . . . + zikuk for some scalars zi1, . . . , zik ∈ R; the goal of PCA is to choose the ui so as to optimise the quality of this approximation over all data points. The optimal such vectors u1, . . . ,uk are the k principal components: u1 is direction of greatest variance in the data, u2 is the direction of greatest variance that is orthogonal to u1, etc. To find the principal components we apply a matrix factorisation technique—the singular value decomposition— to the m × n matrix whose columns are the mean-centred data points xi. In the end, representing each data point xi ∈ Rm by its coordinates zi = (zi1, . . . , zik) with respect to the k principal components yields a lower dimensional and (hopefully) more informative representation.

James W. Moser | J. W. Moser

[1] D. O. Parsons. Quit Rates Over Time: A Search and Information Approach , 1973 .

[2] Richard S. Toikka. A Markovian Model of Labor Market Decisions by Workers , 1976 .

[3] P. Barth. A Time Series Analysis of Layoff Rates , 1971 .

[4] R. Topel. Inventories, Layoffs, and the Short-Run Demand for Labor , 1982 .

[5] K. Clark,et al. Demographic Differences in Cyclical Employment Variation , 1980 .

[6] T. Dernburg,et al. Cyclical Variation in Civilian Labor Force Participation , 1964 .

[7] Stephen T. Marston. Employment Instability and High Unemployment Rates , 1976 .

[8] Hirschel Kasper,et al. The Asking Price of Labor and the Duration of Unemployment , 1967 .

[9] C. C. Holt,et al. Demand for Labor in a Dynamic Theory of the Firm , 1975 .

[10] H. Theil. Principles of econometrics , 1971 .

[11] G. Perry. Unemployment Flows in the U.S. Labor Market , 1972 .

[12] Nicholas M. Kiefer,et al. An Empirical Job-Search Model, with a Test of the Constant Reservation-Wage Hypothesis , 1979, Journal of Political Economy.

[13] Harry C. Benham. Layoff and Recall Impacts of Stocks of Rehirables , 1980 .

[14] Reuben Gronau,et al. Information and Frictional Unemployment , 1971 .

[15] J. E. Freund,et al. Modern elementary statistics , 1953 .