Sparse principal component analysis in cancer research.

A critical challenging component in analyzing high-dimensional data in cancer research is how to reduce the dimension of data and how to extract relevant features. Sparse principal component analysis (PCA) is a powerful statistical tool that could help reduce data dimension and select important variables simultaneously. In this paper, we review several approaches for sparse PCA, including variance maximization (VM), reconstruction error minimization (REM), singular value decomposition (SVD), and probabilistic modeling (PM) approaches. A simulation study is conducted to compare PCA and the sparse PCAs. An example using a published gene signature in a lung cancer dataset is used to illustrate the potential application of sparse PCAs in cancer research.

[1]  Dan Yang,et al.  A Sparse SVD Method for High-dimensional Data , 2011, 1112.2433.

[2]  Ian T. Jolliffe,et al.  Projected gradient approach to the numerical solution of the SCoTLASS , 2006, Comput. Stat. Data Anal..

[3]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[4]  Hans-Peter Kriegel,et al.  Supervised probabilistic principal component analysis , 2006, KDD '06.

[5]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[6]  Jianhua Z. Huang,et al.  Biclustering via Sparse Singular Value Decomposition , 2010, Biometrics.

[7]  Dung-Tsa Chen,et al.  Prognostic and predictive value of a malignancy-risk gene signature in early-stage non-small cell lung cancer. , 2011, Journal of the National Cancer Institute.

[8]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[9]  Jorge Cadima Departamento de Matematica Loading and correlations in the interpretation of principle compenents , 1995 .

[10]  Yi Zhang,et al.  Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. , 2006, Cancer research.

[11]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[12]  Joachim M. Buhmann,et al.  Expectation-maximization for sparse and non-negative PCA , 2008, ICML '08.

[13]  Jostein Halgunset,et al.  Principal component analysis for the comparison of metabolic profiles from human rectal cancer biopsies and colorectal xenografts using high-resolution magic angle spinning 1H magnetic resonance spectroscopy , 2008, Molecular Cancer.

[14]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[15]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[16]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[17]  Ian T. Jolliffe,et al.  Variable selection and interpretation in correlation principal components , 2005 .

[18]  N. Hayward,et al.  Gene Expression Signature Predicts Recurrence in Lung Adenocarcinoma , 2007, Clinical Cancer Research.

[19]  Renee Rubio,et al.  Proliferative genes dominate malignancy-risk gene signature in histologically-normal breast tissue , 2009, Breast Cancer Research and Treatment.

[20]  Igor Jurisica,et al.  Prognostic gene signatures for non-small-cell lung cancer , 2009, Proceedings of the National Academy of Sciences.

[21]  Jennifer G. Dy,et al.  Sparse Probabilistic Principal Component Analysis , 2009, AISTATS.

[22]  Michael Andrew Christie,et al.  Population MCMC methods for history matching and uncertainty quantification , 2010, Computational Geosciences.

[23]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[24]  Jingmei Li,et al.  High-throughput mammographic-density measurement: a tool for risk prediction of breast cancer , 2012, Breast Cancer Research.

[25]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, SIAM Rev..

[26]  Peter Filzmoser,et al.  Robust Sparse Principal Component Analysis , 2013, Technometrics.

[27]  I. Jolliffe Rotation of principal components: choice of normalization constraints , 1995 .

[28]  S. Vines Simple principal components , 2000 .

[29]  Jinbo Bi,et al.  Dimensionality Reduction via Sparse Support Vector Machines , 2003, J. Mach. Learn. Res..

[30]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[31]  Igor Jurisica,et al.  Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study , 2008, Nature Medicine.

[32]  M. Tyers,et al.  Molecular profiling of non-small cell lung cancer and correlation with disease-free survival. , 2002, Cancer research.

[33]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[34]  Martin Sill,et al.  Robust biclustering by sparse singular value decomposition incorporating stability selection , 2011, Bioinform..

[35]  Marcin Skrzypski,et al.  An Immune Response Enriched 72-Gene Prognostic Profile for Early-Stage Non–Small-Cell Lung Cancer , 2009, Clinical Cancer Research.

[36]  Beryl Rawson,et al.  Degrees of Freedom , 2010 .

[37]  Wong-Ho Chow,et al.  Principal component analysis of dietary and lifestyle patterns in relation to risk of subtypes of esophageal and gastric cancer. , 2011, Annals of epidemiology.

[38]  David M Jablons,et al.  Genomic prognostic models in early-stage lung cancer. , 2009, Clinical lung cancer.

[39]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[40]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[41]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[42]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[43]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.