Noise-free principal component analysis: An efficient dimension reduction technique for high dimensional molecular data

Principal component analysis (PCA) is one of the powerful dimension reduction techniques widely used in data mining field. PCA tries to project the data into lower dimensional space while preserving the intrinsic information hidden in the data as much as possible. Disadvantage of PCA is that, extracted principal components (PCs) are linear combination of all features, hence PCs are may still contaminated with noise in the data. To address this problem we propose a modified version of PCA called noise free PCA (NFPCA), in which regularization is introduced during the PCs extraction step to mitigate the effect of noise. Potentials of the proposed method is assessed in two important application of high-dimensional molecular data: classification and survival prediction. Multiple publicly available real-world data sets are used for this illustration. Experimental results show that, the NFPCA produce highly informative than the ordinary PCA method. This is largely due to the fact that the NFPCA suppress the effect of noise in the PCs more efficiently with minimum information lost. The NFPCA is a promising alternative to existing PCA approaches not only in terms of highly informative PCs, but also its relatively cheap computational cost.

[1]  P. Hansen Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion , 1987 .

[2]  I. Jolliffe Principal Component Analysis , 2002 .

[3]  E. Schröck,et al.  Multiple putative oncogenes at the chromosome 20q amplicon contribute to colorectal adenoma to carcinoma progression , 2008, Gut.

[4]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[8]  William A. Schmitt,et al.  Interactive exploration of microarray gene expression patterns in a reduced dimensional space. , 2002, Genome research.

[9]  W. Gerald,et al.  Integration of gene expression profiling and clinical variables to predict prostate carcinoma recurrence after radical prostatectomy , 2005, Cancer.

[10]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[11]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[12]  John A. Swets,et al.  Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: Collected Papers , 1996 .

[13]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  Seyed Mohammad Hosseini,et al.  A new variant of L-curve for Tikhonov regularization , 2009, J. Comput. Appl. Math..

[16]  Carlo Di Bello,et al.  PCA disjoint models for multiclass cancer analysis using gene expression data , 2003, Bioinform..

[17]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[18]  R. Tibshirani,et al.  Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. , 2004, The New England journal of medicine.

[19]  C. W. Groetsch,et al.  The theory of Tikhonov regularization for Fredholm equations of the first kind , 1984 .

[20]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[21]  Per Christian Hansen,et al.  Regularization methods for large-scale problems , 1993 .

[22]  C. Vogel Computational Methods for Inverse Problems , 1987 .

[23]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[24]  Åke Björck,et al.  Numerical methods for least square problems , 1996 .

[25]  M. A. van de Wiel,et al.  Confidence scores for prediction models , 2011, Biometrical journal. Biometrische Zeitschrift.

[26]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[27]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[28]  Michael A. Saunders,et al.  LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares , 1982, TOMS.

[29]  Dianne P. O'Leary,et al.  The Use of the L-Curve in the Regularization of Discrete Ill-Posed Problems , 1993, SIAM J. Sci. Comput..

[30]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .