Complementary feature selection from alternative splicing events and gene expression for phenotype prediction

Abstract Motivation A central task of bioinformatics is to develop sensitive and specific means of providing medical prognoses from biomarker patterns. Common methods to predict phenotypes in RNA-Seq datasets utilize machine learning algorithms trained via gene expression. Isoforms, however, generated from alternative splicing, may provide a novel and complementary set of transcripts for phenotype prediction. In contrast to gene expression, the number of isoforms increases significantly due to numerous alternative splicing patterns, resulting in a prioritization problem for many machine learning algorithms. This study identifies the empirically optimal methods of transcript quantification, feature engineering and filtering steps using phenotype prediction accuracy as a metric. At the same time, the complementary nature of gene and isoform data is analyzed and the feasibility of identifying isoforms as biomarker candidates is examined. Results Isoform features are complementary to gene features, providing non-redundant information and enhanced predictive power when prioritized and filtered. A univariate filtering algorithm, which selects up to the N highest ranking features for phenotype prediction is described and evaluated in this study. An empirical comparison of pipelines for isoform quantification is reported by performing cross-validation prediction tests with datasets from human non-small cell lung cancer (NSCLC) patients, human patients with chronic obstructive pulmonary disease (COPD) and amyotrophic lateral sclerosis (ALS) transgenic mice, each including samples of diseased and non-diseased phenotypes. Availability and Implementation https://github.com/clabuzze/Phenotype-Prediction-Pipeline.git Contact clabuzze@iastate.edu, antoniom@bc.edu, watsondk@musc.edu, andersonpe2@cofc.edu

[1]  M. Johnson,et al.  Circulating microRNAs in Sera Correlate with Soluble Biomarkers of Immune Activation but Do Not Predict Mortality in ART Treated Individuals with HIV-1 Infection: A Case Control Study , 2015, PloS one.

[2]  Ching-Wei Chang,et al.  An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data , 2015, PloS one.

[3]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[4]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[5]  J. D. Macklis,et al.  IGF-I specifically enhances axon outgrowth of corticospinal motor neurons , 2006, Nature Neuroscience.

[6]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..

[7]  L. Levin,et al.  Biodiversity on the Rocks: Macrofauna Inhabiting Authigenic Carbonate at Costa Rica Methane Seeps , 2015, PloS one.

[8]  G. Shepherd,et al.  eGFP Expression under UCHL1 Promoter Genetically Labels Corticospinal Motor Neurons and a Subpopulation of Degeneration-Resistant Spinal Motor Neurons in an ALS Mouse Model , 2013, The Journal of Neuroscience.

[9]  Paul E. Anderson,et al.  Predictive modeling of lung cancer recurrence using alternative splicing events versus differential expression data , 2014, 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology.

[10]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[11]  Julie A. Dickerson,et al.  Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems , 2014, BMC Bioinformatics.

[12]  Xi Wang,et al.  Gene set enrichment analysis of RNA-Seq data: integrating differential expression and splicing , 2013, BMC Bioinformatics.

[13]  Mihaela Zavolan,et al.  Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data , 2015, Genome Biology.

[14]  Cole Trapnell,et al.  Role of Rodent Secondary Motor Cortex in Value-based Action Selection Nih Public Access Author Manuscript , 2006 .

[15]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[16]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[17]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[18]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[19]  Jae Seung Lee,et al.  Comprehensive Analysis of Transcriptome Sequencing Data in the Lung Tissues of COPD Subjects , 2015, International journal of genomics.

[20]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[21]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[22]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[23]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..

[24]  Mijeong Kim,et al.  Quantitative Shotgun Proteomics Analysis of Rice Anther Proteins after Exposure to High Temperature , 2015, International journal of genomics.