Towards a theoretical analysis of PCA for heteroscedastic data

Principal Component Analysis (PCA) is a method for estimating a subspace given noisy samples. It is useful in a variety of problems ranging from dimensionality reduction to anomaly detection and the visualization of high dimensional data. PCA performs well in the presence of moderate noise and even with missing data, but is also sensitive to outliers. PCA is also known to have a phase transition when noise is independent and identically distributed; recovery of the subspace sharply declines at a threshold noise variance. Effective use of PCA requires a rigorous understanding of these behaviors. This paper provides a step towards an analysis of PCA for samples with heteroscedastic noise, that is, samples that have non-uniform noise variances and so are no longer identically distributed. In particular, we provide a simple asymptotic prediction of the recovery of a one-dimensional subspace from noisy heteroscedastic samples. The prediction enables: a) easy and efficient calculation of the asymptotic performance, and b) qualitative reasoning to understand how PCA is impacted by heteroscedasticity (such as outliers).

[1]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[2]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[3]  Laura Balzano,et al.  Incremental gradient on the Grassmannian for online foreground and background separation in subsampled video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  C. Tracy,et al.  Introduction to Random Matrices , 1992, hep-th/9210073.

[5]  S. Chatterjee,et al.  Matrix estimation by Universal Singular Value Thresholding , 2012, 1212.1247.

[6]  Mark Crovella,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM '04.

[7]  Raj Rao Nadakuditi,et al.  The singular values and vectors of low rank perturbations of large rectangular random matrices , 2011, J. Multivar. Anal..

[8]  Michael J. Black,et al.  Robust Principal Component Analysis for Computer Vision , 2001, ICCV.

[9]  Guangming Pan,et al.  Strong convergence of the empirical distribution of eigenvalues of sample covariance matrices with a perturbation matrix , 2010, J. Multivar. Anal..

[10]  Gregory S. Wagner,et al.  Signal detection using multi-channel seismic data , 1996 .

[11]  John Wright,et al.  Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization , 2009, NIPS.

[12]  Iwao Kanno,et al.  Activation detection in functional MRI using subspace modeling and maximum likelihood estimation , 1999, IEEE Transactions on Medical Imaging.

[13]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[14]  Alan Edelman,et al.  The Polynomial Method for Random Matrices , 2008, Found. Comput. Math..

[15]  Ronen Basri,et al.  Lambertian reflectance and linear subspaces , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[16]  Kriti Saroha,et al.  A novel dimensionality reduction method for cancer dataset using PCA and Feature Ranking , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).