A Class-Information-Based Sparse Component Analysis Method to Identify Differentially Expressed Genes on RNA-Seq Data

With the development of deep sequencing technologies, many RNA-Seq data have been generated. Researchers have proposed many methods based on the sparse theory to identify the differentially expressed genes from these data. In order to improve the performance of sparse principal component analysis, in this paper, we propose a novel class-information-based sparse component analysis (CISCA) method which introduces the class information via a total scatter matrix. First, CISCA normalizes the RNA-Seq data by using a Poisson model to obtain their differential sections. Second, the total scatter matrix is gotten by combining the between-class and within-class scatter matrices. Third, we decompose the total scatter matrix by using singular value decomposition and construct a new data matrix by using singular values and left singular vectors. Then, aiming at obtaining sparse components, CISCA decomposes the constructed data matrix by solving an optimization problem with sparse constraints on loading vectors. Finally, the differentially expressed genes are identified by using the sparse loading vectors. The results on simulation and real RNA-Seq data demonstrate that our method is effective and suitable for analyzing these data.

[1]  David Zhang,et al.  A Survey of Sparse Representation: Algorithms and Applications , 2015, IEEE Access.

[2]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[3]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[4]  Rohini Garg,et al.  A global view of transcriptome dynamics during flower development in chickpea by deep sequencing. , 2013, Plant biotechnology journal.

[5]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[6]  Annarita D'Addabbo,et al.  SVD Based Feature Selection and Sample Classification of Proteomic Data , 2008, KES.

[7]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[8]  Marcel H. Schulz,et al.  A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome , 2008, Science.

[9]  Yurii Nesterov,et al.  Generalized Power Method for Sparse Principal Component Analysis , 2008, J. Mach. Learn. Res..

[10]  Dong Xu,et al.  Trace Ratio vs. Ratio Trace for Dimensionality Reduction , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Matthew D. Young,et al.  From RNA-seq reads to differential expression results , 2010, Genome Biology.

[12]  Haipeng Shen,et al.  Poisson factor models with applications to non-normalized microRNA profiling , 2013, Bioinform..

[13]  Woojoo Lee,et al.  Super-sparse principal component analyses for high-throughput genomic data , 2010, BMC Bioinformatics.

[14]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[15]  Joaquim F. Pinto da Costa,et al.  A Weighted Principal Component Analysis and Its Application to Gene Expression Data , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[17]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[18]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[19]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[20]  Satoru Miyano,et al.  A Top-r Feature Selection Algorithm for Microarray Gene Expression Data , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  David Zhang,et al.  Face recognition based on a novel linear discriminant criterion , 2006, Pattern Analysis and Applications.

[22]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[23]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[24]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[25]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[26]  Daniel Rios,et al.  Ensembl 2011 , 2010, Nucleic Acids Res..

[27]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[28]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[29]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, SIAM Rev..

[30]  Andrew E. Teschendorff,et al.  A comparison of feature selection and classification methods in DNA methylation studies using the Illumina Infinium platform , 2012, BMC Bioinformatics.

[31]  Hongkai Ji,et al.  Differential principal component analysis of ChIP-seq , 2013, Proceedings of the National Academy of Sciences.

[32]  Lester W. Mackey,et al.  Deflation Methods for Sparse PCA , 2008, NIPS.

[33]  Yu Zhu,et al.  Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq , 2012, Bioinform..