Edge‐group sparse PCA for network‐guided high dimensional data analysis

Motivation: Principal component analysis (PCA) has been widely used to deal with high‐dimensional gene expression data. In this study, we proposed an Edge‐group Sparse PCA (ESPCA) model by incorporating the group structure from a prior gene network into the PCA framework for dimension reduction and feature interpretation. ESPCA enforces sparsity of principal component (PC) loadings through considering the connectivity of gene variables in the prior network. We developed an alternating iterative algorithm to solve ESPCA. The key of this algorithm is to solve a new k‐edge sparse projection problem and a greedy strategy has been adapted to address it. Here we adopted ESPCA for analyzing multiple gene expression matrices simultaneously. By incorporating prior knowledge, our method can overcome the drawbacks of sparse PCA and capture some gene modules with better biological interpretations. Results: We evaluated the performance of ESPCA using a set of artificial datasets and two real biological datasets (including TCGA pan‐cancer expression data and ENCODE expression data), and compared their performance with PCA and sparse PCA. The results showed that ESPCA could identify more biologically relevant genes, improve their biological interpretations and reveal distinct sample characteristics. Availability and implementation: An R package of ESPCA is available at http://page.amss.ac.cn/shihua.zhang/ Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[2]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[3]  Luo Xiao,et al.  Learning regulatory programs by threshold SVD regression , 2014, Proceedings of the National Academy of Sciences.

[4]  Can Yang,et al.  Simultaneous dimension reduction and adjustment for confounding variation , 2016, Proceedings of the National Academy of Sciences.

[5]  Florence Demenais,et al.  SigMod: an exact and efficient method to identify a strongly interconnected disease‐associated module in a gene network , 2017, Bioinform..

[6]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[7]  Yurii Nesterov,et al.  Generalized Power Method for Sparse Principal Component Analysis , 2008, J. Mach. Learn. Res..

[8]  Thomas Höllt,et al.  BrainScope: interactive visual exploration of the spatial and temporal human brain transcriptome , 2017, Nucleic acids research.

[9]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[10]  Elias Campo Guerri,et al.  International network of cancer genome projects , 2010 .

[11]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[12]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[13]  Martin Sill,et al.  Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data , 2015, Bioinform..

[14]  Enrico Glaab,et al.  Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification , 2015, Briefings Bioinform..

[15]  Xiao-Tong Yuan,et al.  Truncated power method for sparse eigenvalue problems , 2011, J. Mach. Learn. Res..

[16]  Benjamin J. Raphael,et al.  Pan-Cancer Network Analysis Identifies Combinations of Rare Somatic Mutations across Pathways and Protein Complexes , 2014, Nature Genetics.

[17]  Andrea Montanari,et al.  Sparse PCA via Covariance Thresholding , 2013, J. Mach. Learn. Res..

[18]  Sorin Draghici,et al.  An approach to infer putative disease‐specific mechanisms using neighboring gene networks , 2017, Bioinform..

[19]  Markus Ringnér,et al.  What is principal component analysis? , 2008, Nature Biotechnology.

[20]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[21]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[22]  Ying-Lin Hsu,et al.  Sparse principal component analysis in cancer research. , 2014, Translational cancer research.

[23]  Mackenzie W. Mathis,et al.  ALS disrupts spinal motor neuron maturation and aging pathways within gene co-expression networks , 2016, Nature Neuroscience.

[24]  Qi Zhu,et al.  A Class-Information-Based Sparse Component Analysis Method to Identify Differentially Expressed Genes on RNA-Seq Data , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Eran Halperin,et al.  Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies , 2016, Nature Methods.

[26]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[27]  Tobias Müller,et al.  Identifying functional modules in protein–protein interaction networks: an integrated exact approach , 2008, ISMB.

[28]  Hongkai Ji,et al.  Differential principal component analysis of ChIP-seq , 2013, Proceedings of the National Academy of Sciences.

[29]  Zhaoran Wang,et al.  Sparse PCA with Oracle Property , 2014, NIPS.

[30]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[31]  Chengqi Zhang,et al.  Conference on Neural Information Processing Systems , 2019 .

[32]  John D. Storey,et al.  Statistical significance of variables driving systematic variation in high-dimensional data , 2013, Bioinform..

[33]  Dmitri D. Pervouchine,et al.  Gene-specific patterns of expression variation across organs and species , 2016, Genome Biology.

[34]  Martin Sill,et al.  Robust biclustering by sparse singular value decomposition incorporating stability selection , 2011, Bioinform..

[35]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[36]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[37]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[38]  Shuigeng Zhou,et al.  NEpiC: a network-assisted algorithm for epigenetic studies using mean and variance combined signals , 2016, Nucleic acids research.

[39]  Benno Schwikowski,et al.  Network-based analysis of omics data: the LEAN method , 2016, Bioinform..

[40]  Jianhua Z. Huang,et al.  Biclustering via Sparse Singular Value Decomposition , 2010, Biometrics.

[41]  Ying Dai,et al.  Principal component analysis based methods in bioinformatics studies , 2011, Briefings Bioinform..

[42]  Martin Renqiang Min,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .