59 2.4 Extensions of probabilistic PCA PCA of large-scale datasets with many missing values Principal component analysis (PCA) is a classical data analysis technique. Some algorithms for PCA scale better than others to problems with high dimensionality. They also differ in the ability to handle missing values in the data. In our recent papers [16, 17], a case is studied where the data are high-dimensional and a majority of the values are missing. In the case of very sparse data, overfitting becomes a severe problem even in simple linear models such as PCA. Regularization can be provided using the Bayesian approach by introducing prior for the model parameters. The PCA model can then be identified using, for example, maximum a posteriori estimates (MAPPCA) or variational Bayesian (VBPCA) learning. In [16, 17], we study different approaches to PCA for incomplete data. We show that faster convergence can be achieved using the following rule for the model parameters: θ i ← θ i − γ ∂ 2 C ∂θ 2 i −α ∂C ∂θ i , where α is a control parameter that allows the learning algorithm to vary from the standard gradient descent (α = 0) to the diagonal Newton's method (α = 1). These learning rules can be used for standard PCA learning and extended to MAPPCA and VBPCA. The algorithms were tested on the Netflix problem (http://www.netflixprize.com/), which is a task of predicting preferences (or producing personal recommendations) by using other people's preferences. The Netflix problem consists of movie ratings given by 480189 customers to 17770 movies. There are 100480507 ratings from 1 to 5 given, and the task is to predict 2817131 other ratings among the same group of customers and movies. 1408395 of the ratings are reserved for validation. Thus, 98.8% of the values are missing. We used different variants of PCA in order to predict the test ratings in the Netflix data set. The obtained results are shown in Figure 2.5. The best accuracy was obtained using VB PCA with a simplified form of the posterior approximation (VBPCAd in Figure 2.5). That method was also able to provide reasonable estimates of the uncertainties of the predictions.
[1]
Hyeonjoon Moon,et al.
The FERET Evaluation Methodology for Face-Recognition Algorithms
,
2000,
IEEE Trans. Pattern Anal. Mach. Intell..
[2]
Jorma Laaksonen,et al.
Projective Non-Negative Matrix Factorization with Applications to Facial Image Processing
,
2007,
Int. J. Pattern Recognit. Artif. Intell..
[3]
Harri Valpola,et al.
Denoising Source Separation
,
2005,
J. Mach. Learn. Res..
[4]
Erkki Oja,et al.
The FastICA Algorithm Revisited: Convergence Analysis
,
2006,
IEEE Transactions on Neural Networks.
[5]
Erkki Oja,et al.
Projective Nonnegative Matrix Factorization for Image Compression and Feature Extraction
,
2005,
SCIA.
[6]
E. Oja,et al.
Performance Analysis of the FastICA Algorithm and Cramér – Rao Bounds for Linear Independent Component Analysis
,
2010
.
[7]
Erkki Oja,et al.
Exploratory analysis of climate data using source separation methods
,
2006,
Neural Networks.
[8]
Jorma Laaksonen,et al.
Multiplicative updates for non-negative projections
,
2007,
Neurocomputing.
[9]
H. Sebastian Seung,et al.
Learning the parts of objects by non-negative matrix factorization
,
1999,
Nature.
[10]
M. Richman,et al.
Rotation of principal components
,
1986
.