Systematically Differentiating Functions for Alternatively Spliced Isoforms through Integrating RNA-seq Data

Integrating large-scale functional genomic data has significantly accelerated our understanding of gene functions. However, no algorithm has been developed to differentiate functions for isoforms of the same gene using high-throughput genomic data. This is because standard supervised learning requires ‘ground-truth’ functional annotations, which are lacking at the isoform level. To address this challenge, we developed a generic framework that interrogates public RNA-seq data at the transcript level to differentiate functions for alternatively spliced isoforms. For a specific function, our algorithm identifies the ‘responsible’ isoform(s) of a gene and generates classifying models at the isoform level instead of at the gene level. Through cross-validation, we demonstrated that our algorithm is effective in assigning functions to genes, especially the ones with multiple isoforms, and robust to gene expression levels and removal of homologous gene pairs. We identified genes in the mouse whose isoforms are predicted to have disparate functionalities and experimentally validated the ‘responsible’ isoforms using data from mammary tissue. With protein structure modeling and experimental evidence, we further validated the predicted isoform functional differences for the genes Cdkn2a and Anxa6. Our generic framework is the first to predict and differentiate functions for alternatively spliced isoforms, instead of genes, using genomic data. It is extendable to any base machine learner and other species with alternatively spliced isoforms, and shifts the current gene-centered function prediction to isoform-level predictions.

[1]  Aalt DJ van Dijk,et al.  Assessing the contribution of alternative splicing to proteome diversity in Arabidopsis thaliana using proteomics data , 2011, BMC Plant Biology.

[2]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..

[3]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[4]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[5]  Franck Bonnetain,et al.  Overexpression of Caspase-3s Splice Variant in Locally Advanced Breast Carcinoma Is Associated with Poor Response to Neoadjuvant Chemotherapy , 2006, Clinical Cancer Research.

[6]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[7]  Yang Zhang,et al.  Innovations in proteomic profiling of cancers: alternative splice variants as a new class of cancer biomarker candidates and bridging of proteomics with structural biology. , 2013, Journal of proteomics.

[8]  Gilbert S. Omenn,et al.  Alternative Splice Variants, a New Class of Protein Cancer Biomarker Candidates: Findings in Pancreatic Cancer and Breast Cancer with Systems Biology Implications , 2010, Disease markers.

[9]  Kai Li,et al.  Exploring the functional landscape of gene expression: directed search of large microarray compendia , 2007, Bioinform..

[10]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[11]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[12]  Paul Flicek,et al.  An integrated functional genomics approach identifies the regulatory network directed by brachyury (T) in chordoma , 2012, The Journal of pathology.

[13]  Stanley Letovsky,et al.  Predicting protein function from protein/protein interaction data: a probabilistic approach , 2003, ISMB.

[14]  Ko-Fan Chen,et al.  Functional genomics in Drosophila models of human disease. , 2012, Briefings in functional genomics.

[15]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[16]  Thomas E. Royce,et al.  Global Identification of Human Transcribed Sequences with Genome Tiling Arrays , 2004, Science.

[17]  L. Staudt,et al.  Burkitt lymphoma pathogenesis and therapeutic targets from structural and functional genomics , 2012, Nature.

[18]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[19]  Razvan C. Bunescu,et al.  Multiple instance learning for sparse positive bags , 2007, ICML '07.

[20]  Matthew A. Hibbs,et al.  Finding function: evaluation methods for functional genomic data , 2006, BMC Genomics.

[21]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[22]  Charles R Sanders,et al.  Tailoring of membrane proteins by alternative splicing of pre-mRNA. , 2012, Biochemistry.

[23]  Jan Ramon,et al.  Multi instance neural networks , 2000, ICML 2000.

[24]  J A Blake,et al.  Program description: Strategies for biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics. , 2001, Genomics.

[25]  Thomas Hofmann,et al.  Multiple instance learning with generalized support vector machines , 2002, AAAI/IAAI.

[26]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[27]  O. Troyanskaya,et al.  Predicting gene function in a hierarchical context with an ensemble of classifiers , 2008, Genome Biology.

[28]  Veit Flockerzi,et al.  Alternative Splicing Switches the Divalent Cation Selectivity of TRPM3 Channels* , 2005, Journal of Biological Chemistry.

[29]  Mathieu Lupien,et al.  gene disrupted by the 17 q 24 . 3 prostate cancer risk locus SOX 9 the Integrative functional genomics identifies an enhancer looping to Material , 2012 .

[30]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[31]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[32]  David Page,et al.  Multiple Instance Regression , 2001, ICML.

[33]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[34]  Robertson Craig,et al.  TANDEM: matching proteins with tandem mass spectra. , 2004, Bioinformatics.

[35]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[36]  Jun Wan,et al.  Dynamic usage of alternative splicing exons during mouse retina development , 2011, Nucleic acids research.

[37]  F. Clark,et al.  Understanding alternative splicing: towards a cellular code , 2005, Nature Reviews Molecular Cell Biology.

[38]  D. Black Mechanisms of alternative pre-messenger RNA splicing. , 2003, Annual review of biochemistry.

[39]  Alfonso Valencia,et al.  APPRIS: annotation of principal and alternative splice isoforms , 2012, Nucleic Acids Res..

[40]  J. Rutka,et al.  The INK4A/ARF Locus: Role in Cell Cycle Control and Apoptosis and Implications for Glioma Growth , 2001, Journal of Neuro-Oncology.

[41]  Pedro A. F. Galante,et al.  Alternative splicing and genetic diversity: silencers are more frequently modified by SNVs associated with alternative exon/intron borders , 2011, Nucleic acids research.

[42]  Brendan J. Frey,et al.  Deciphering the splicing code , 2010, Nature.

[43]  J. Harrow,et al.  Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene , 2013, Genome Biology.

[44]  Vladimir Vacic,et al.  Graphlet Kernels for Prediction of Functional Residues in Protein Structures , 2010, J. Comput. Biol..

[45]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[46]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[47]  Alfonso Valencia,et al.  Determination and validation of principal gene products , 2008, Bioinform..

[48]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[49]  Peter Tompa,et al.  Verification of alternative splicing variants based on domain integrity, truncation length and intrinsic protein disorder , 2010, Nucleic acids research.

[50]  Gabriel del Rio,et al.  Improved prediction of critical residues for protein function based on network and phylogenetic analyses , 2005, BMC Bioinformatics.

[51]  Yang Zhang,et al.  I-TASSER: a unified platform for automated protein structure and function prediction , 2010, Nature Protocols.

[52]  Christine A. Orengo,et al.  Protein function prediction using domain families , 2013, BMC Bioinformatics.

[53]  Csaba Szepesvari,et al.  Prediction of protein functional domains from sequences using artificial neural networks. , 2001, Genome research.

[54]  M. Liontos,et al.  The tumor suppressor gene ARF as a sensor of oxidative stress. , 2012, Current molecular medicine.

[55]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[56]  Yang Zhang,et al.  Functional implications of structural predictions for alternative splice proteins expressed in Her2/neu-induced breast cancers. , 2011, Journal of proteome research.

[57]  R. Skotheim,et al.  Alternative splicing in cancer: noise, functional, or systematic? , 2007, The international journal of biochemistry & cell biology.

[58]  J. Finsterer,et al.  Ataxias with Autosomal, X-Chromosomal or Maternal Inheritance , 2009, Canadian Journal of Neurological Sciences / Journal Canadien des Sciences Neurologiques.

[59]  Pingzhao Hu,et al.  Computational prediction of cancer-gene function , 2007, Nature Reviews Cancer.

[60]  Susumu Goto,et al.  KEGG for representation and analysis of molecular networks involving diseases and drugs , 2009, Nucleic Acids Res..

[61]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[62]  B. Frey,et al.  Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. , 2004, Molecular cell.

[63]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[64]  Yuanfang Guan,et al.  Functional Genomics Complements Quantitative Genetics in Identifying Disease-Gene Associations , 2010, PLoS Comput. Biol..

[65]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[66]  Gil Ast,et al.  Alternative splicing and disease , 2008, RNA biology.

[67]  L. Shkreta,et al.  Protein Kinase C-Dependent Control of Bcl-x Alternative Splicing , 2007, Molecular and Cellular Biology.

[68]  Gilbert S Omenn,et al.  Identification of alternatively spliced transcripts using a proteomic informatics approach. , 2011, Methods in molecular biology.

[69]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[70]  Hyunsoo Kim,et al.  IsoformEx: isoform level gene expression estimation using weighted non-negative least squares from mRNA-Seq data , 2011, BMC Bioinformatics.

[71]  Gunnar Rätsch,et al.  rQuant.web: a tool for RNA-Seq-based transcript quantitation , 2010, Nucleic Acids Res..

[72]  Karin M. Verspoor,et al.  Text Mining Improves Prediction of Protein Functional Sites , 2012, PloS one.

[73]  Ying Xu,et al.  Prediction of functional modules based on comparative genome analysis and Gene Ontology application , 2005, Nucleic acids research.

[74]  Johannes Söding,et al.  Prediction of protein functional residues from sequence by probability density estimation , 2008, Bioinform..

[75]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[76]  Stefanie Mannebach,et al.  Alternative Splicing of a Protein Domain Indispensable for Function of Transient Receptor Potential Melastatin 3 (TRPM3) Ion Channels* , 2012, The Journal of Biological Chemistry.

[77]  Heidi Zhang,et al.  Integrated pipeline for mass spectrometry-based discovery and confirmation of biomarkers demonstrated in a mouse model of breast cancer. , 2007, Journal of proteome research.

[78]  Tao Jiang,et al.  Inference of Isoforms from Short Sequence Reads , 2010, RECOMB.

[79]  G. Mills,et al.  Whole-exome sequencing combined with functional genomics reveals novel candidate driver cancer genes in endometrial cancer , 2012, Genome research.

[80]  M. Kochańczyk,et al.  Prediction of functionally important residues in globular proteins from unusual central distances of amino acids , 2011, BMC Structural Biology.

[81]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .