Filtering Variables for Supervised Sparse Network Analysis

Motivation We present a method for dimension reduction designed to filter variables or features such as genes considered to be irrelevant for a downstream analysis designed to detect supervised gene networks in sparse settings. This approach can improve interpret-ability for a variety of analysis methods. We present a method to filter genes and transcripts prior to network analysis. This method has applications in a setting where the downstream analysis may include sparse canonical correlation analysis. Results Filtering methods specifically for cluster and network analysis are introduced and compared by simulating modular networks with known statistical properties. Our proposed method performs favorably eliminating irrelevant features but maintaining important biological signal under a variety of different signal settings. We show that the speed and accuracy of methods such as sparse canonical correlation are increased after filtering, thus greatly improving the scalability of these approaches. Availability Code for performing the gene filtering algorithm described in this manuscript may be accessed through the geneFiltering R package available on Github at https://github.com/lorinmil/geneFiltering. Functions are available to filter genes and perform simulations of a network system. For access to the data used in this manuscript, contact corresponding author. Contact lorinmil@buffalo.edu, jcm38@buffalo.edu, fzhang8@buffalo.edu, and dlt6@buffalo.edu

[1]  A. Zwinderman,et al.  Statistical Applications in Genetics and Molecular Biology Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis , 2011 .

[2]  Xing-Ming Zhao,et al.  NARROMI: a noise and redundancy reduction technique improves accuracy of gene regulatory network inference , 2013, Bioinform..

[3]  David Tritchler,et al.  BMC Bioinformatics BioMed Central Methodology article Filtering Genes for Cluster and Network Analysis , 2009 .

[4]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[5]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[6]  Jeffrey C. Miecznikowski,et al.  Identification of consistent functional genetic modules , 2016, Statistical applications in genetics and molecular biology.

[7]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[8]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[9]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[10]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[11]  Jeffrey T Leek,et al.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown , 2016, Nature Protocols.

[12]  J. Bourdon,et al.  Uncovering the role of p53 splice variants in human malignancy: a clinical perspective , 2013, OncoTargets and therapy.

[13]  Tao Wang,et al.  Accurate identification of single nucleotide variants in whole genome amplified single cells , 2017, Nature Methods.

[14]  D. Pe’er,et al.  RHPN2 drives mesenchymal transformation in malignant glioma by triggering RhoA activation. , 2013, Cancer research.

[15]  A. Lamond,et al.  Multidimensional proteomics for cell biology , 2015, Nature Reviews Molecular Cell Biology.

[16]  L. Chin,et al.  Making sense of cancer genomic data. , 2011, Genes & development.

[17]  Stefan Posch,et al.  Supervised Penalized Canonical Correlation Analysis , 2014, 1405.1534.

[18]  J A Blessing,et al.  Surgical Staging in Endometrial Cancer: Clinical—Pathologic Findings of a Prospective Study , 1984, Obstetrics and gynecology.

[19]  Steven J. M. Jones,et al.  Integrated genomic characterization of endometrial carcinoma , 2013, Nature.

[20]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[21]  Fan Zhang,et al.  Identification of supervised and sparse functional genomic pathways , 2020, Statistical applications in genetics and molecular biology.

[22]  M. Snyder,et al.  High-throughput sequencing technologies. , 2015, Molecular cell.

[23]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[24]  Jason H. Moore,et al.  Pathway analysis of genomic data: concepts, methods, and prospects for future development. , 2012, Trends in genetics : TIG.

[25]  Gary D Bader,et al.  Pathway and network analysis of cancer genomes , 2015, Nature Methods.

[26]  Jengnan Tzeng,et al.  Multidimensional scaling for large genomic data sets , 2008, BMC Bioinformatics.