Sparse singular value decomposition-based feature extraction for identifying differentially expressed genes

Recently, feature extraction and dimensionality reduction have become fundamental tools for many data mining tasks, especially for processing high-dimensional data such as genome data. In this paper, a new feature extraction method based on sparse singular value decomposition (SSVD) is developed. SSVD algorithm is applied to extract differentially expressed genes from two different genome datasets that are all from The Cancer Genome Atlas (TCGA), and then the extracted genes are evaluated by the tools based on Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. As a gene extraction method, SSVD is also compared with some existing feature extraction methods such as independent component analysis, the p-norm robust feature extraction and sparse principal component analysis. The experimental GO analysis results show that SSVD method outperforms the competitive algorithms. The KEGG analysis results demonstrate the genes which participate in the pathways in cancer. The elaborate experiments prove that SSVD is an effective feature selection method compared with the competitive methods. The KEGG analysis results may provide a meaningful reference to carry out further study for professionals in the field of biomedical science.

[1]  Jianhua Z. Huang,et al.  Biclustering via Sparse Singular Value Decomposition , 2010, Biometrics.

[2]  Xing-Ming Zhao,et al.  Classifying protein sequences using hydropathy blocks , 2006, Pattern Recognit..

[3]  Yurii Nesterov,et al.  Generalized Power Method for Sparse Principal Component Analysis , 2008, J. Mach. Learn. Res..

[4]  H. V. Jagadish,et al.  ConceptGen: a gene set enrichment and gene set relation mapping tool , 2010, Bioinform..

[5]  Jing-Yu Yang,et al.  Characteristic Gene Selection via Weighting Principal Components by Singular Values , 2012, PloS one.

[6]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[7]  De-Shuang Huang,et al.  Independent component analysis-based penalized discriminant method for tumor classification using gene expression data , 2006, Bioinform..

[8]  Woojoo Lee,et al.  Super-sparse principal component analyses for high-throughput genomic data , 2010, BMC Bioinformatics.

[9]  Jin-Xing Liu,et al.  A P-Norm Robust Feature Extraction Method for Identifying Differentially Expressed Genes , 2015, PloS one.

[10]  Yong Xu,et al.  Robust PCA based method for discovering differentially expressed genes , 2013, BMC Bioinformatics.

[11]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[12]  B. Wang,et al.  Inferring protein-protein interacting sites using residue conservation and evolutionary information. , 2006, Protein and peptide letters.

[13]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[14]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[15]  Ben-Ari FuchsShani,et al.  GeneAnalytics: An Integrative Gene Set Analysis Tool for Next Generation Sequencing, RNAseq and Microarray Data , 2016 .

[16]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[17]  Paul C H Li,et al.  Microfluidic DNA microarray analysis: a review. , 2011, Analytica chimica acta.

[18]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[19]  Lei Zhang,et al.  Tumor Clustering Using Nonnegative Matrix Factorization With Gene Selection , 2009, IEEE Transactions on Information Technology in Biomedicine.

[20]  Yong Xu,et al.  Extracting plants core genes responding to abiotic stresses by penalized matrix decomposition , 2012, Comput. Biol. Medicine.

[21]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[22]  Doron Lancet,et al.  MalaCards: an integrated compendium for diseases and their annotation , 2013, Database J. Biol. Databases Curation.

[23]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..