High-resolution functional annotation of human transcriptome: predicting isoform functions by a novel multiple instance-based label propagation method

Alternative transcript processing is an important mechanism for generating functional diversity in genes. However, little is known about the precise functions of individual isoforms. In fact, proteins (translated from transcript isoforms), not genes, are the function carriers. By integrating multiple human RNA-seq data sets, we carried out the first systematic prediction of isoform functions, enabling high-resolution functional annotation of human transcriptome. Unlike gene function prediction, isoform function prediction faces a unique challenge: the lack of the training data—all known functional annotations are at the gene level. To address this challenge, we modelled the gene–isoform relationships as multiple instance data and developed a novel label propagation method to predict functions. Our method achieved an average area under the receiver operating characteristic curve of 0.67 and assigned functions to 15 572 isoforms. Interestingly, we observed that different functions have different sensitivities to alternative isoform processing, and that the function diversity of isoforms from the same gene is positively correlated with their tissue expression diversity. Finally, we surveyed the literature to validate our predictions for a number of apoptotic genes. Strikingly, for the famous ‘TP53’ gene, we not only accurately identified the apoptosis regulation function of its five isoforms, but also correctly predicted the precise direction of the regulation.

[1]  Duy-Dinh Le,et al.  Improving Image Categorization by Using Multiple Instance Learning with Spatial Relation , 2011, ICIAP.

[2]  Yi Shi,et al.  A Model-Free Greedy Gene Selection for Microarray Sample Class Prediction , 2006, 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[3]  Alex Bateman,et al.  Tissue-Specific Splicing of Disordered Segments that Embed Binding Motifs Rewires Protein Interaction Networks , 2012, Molecular cell.

[4]  P. Hainaut,et al.  ΔN-p53, a natural isoform of p53 lacking the first transactivation domain, counteracts growth suppression by wild-type p53 , 2002, Oncogene.

[5]  A. D. de Vos,et al.  Two-amino acid molecular switch in an epithelial morphogen that regulates binding to two distinct receptors. , 2000, Science.

[6]  Doron Lancet,et al.  Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification , 2005, Bioinform..

[7]  Thomas Lengauer Bioinformatics : from genomes to therapies , 2007 .

[8]  C. Thompson,et al.  bcl-x, a bcl-2-related gene that functions as a dominant regulator of apoptotic cell death , 1993, Cell.

[9]  J. Schmee An Introduction to Multivariate Statistical Analysis , 1986 .

[10]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[11]  Borivoj Vojtesek,et al.  p53 isoforms Δ133p53 and p53β are endogenous regulators of replicative cellular senescence , 2009, Nature Cell Biology.

[12]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[13]  M. Schmidt,et al.  Cloning of an interferon regulatory factor 2 isoform with different regulatory ability. , 2000, Nucleic acids research.

[14]  Francisco E. Baralle,et al.  Genomic variants in exons and introns: identifying the splicing spoilers , 2004, Nature Reviews Genetics.

[15]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[16]  K. Vogan,et al.  An alternative splicing event in the Pax-3 paired domain identifies the linker region as a key determinant of paired domain DNA-binding activity , 1996, Molecular and cellular biology.

[17]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy , 2011, Nucleic Acids Res..

[18]  Changshui Zhang,et al.  Instance-level Semisupervised Multiple Instance Learning , 2008, AAAI.

[19]  M. West,et al.  An integrative approach to characterize disease-specific pathways and their coordination: a case study in cancer , 2008, BMC Genomics.

[20]  Russ B Altman,et al.  Large scale study of protein domain distribution in the context of alternative splicing. , 2003, Nucleic acids research.

[21]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[22]  David P Lane,et al.  p53 isoforms can regulate p53 transcriptional activity. , 2005, Genes & development.

[23]  P. Krammer,et al.  Cellular FLICE-inhibitory Protein Splice Variants Inhibit Different Steps of Caspase-8 Activation at the CD95 Death-inducing Signaling Complex* , 2001, The Journal of Biological Chemistry.

[24]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[25]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[26]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[27]  John Moult,et al.  Stochastic noise in splicing machinery , 2009 .

[28]  Yongchao Liu,et al.  Long read alignment based on maximal exact match seeds , 2012, Bioinform..

[29]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[30]  Wei Liu,et al.  Robust and Scalable Graph-Based Semisupervised Learning , 2012, Proceedings of the IEEE.

[31]  James B. Brown,et al.  Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation , 2011, Proceedings of the National Academy of Sciences.

[32]  William Stafford Noble,et al.  Integrating Information for Protein Function Prediction , 2008 .

[33]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[34]  Dennis Shasha,et al.  Parametric Bayesian priors and better choice of negative examples improve protein function prediction , 2013, Bioinform..

[35]  A Keith Dunker,et al.  Alternative splicing in concert with protein intrinsic disorder enables increased functional diversity in multicellular organisms. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[36]  K. Münger,et al.  TID1, a human homolog of the Drosophila tumor suppressor l(2)tid, encodes two mitochondrial modulators of apoptosis with opposing functions. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[38]  Quaid Morris,et al.  Fast integration of heterogeneous data sources for predicting gene function with limited annotation , 2010, Bioinform..

[39]  Masashi Sugiyama,et al.  Robust Label Propagation on Multiple Networks , 2009, IEEE Transactions on Neural Networks.

[40]  L. Pachter,et al.  Streaming fragment assignment for real-time analysis of sequencing experiments , 2012, Nature Methods.

[41]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[42]  Wolfgang Huber,et al.  A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks , 2004, Statistical applications in genetics and molecular biology.

[43]  P. Hainaut,et al.  DeltaN-p53, a natural isoform of p53 lacking the first transactivation domain, counteracts growth suppression by wild-type p53. , 2002, Oncogene.

[44]  P. Radivojac,et al.  Analysis of protein function and its prediction from amino acid sequence , 2011, Proteins.

[45]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[46]  Haifeng Li,et al.  Integrative Analysis of Many Weighted Co-Expression Networks Using Tensor Computation , 2011, PLoS Comput. Biol..

[47]  Yi Xing,et al.  Assessing the impact of alternative splicing on domain interactions in the human proteome. , 2004, Journal of proteome research.

[48]  Richard G. H. Immink,et al.  Predicting the Impact of Alternative Splicing on Plant MADS Domain Protein Function , 2012, PloS one.

[49]  C. Obie,et al.  Molecular enzymology of mammalian Delta1-pyrroline-5-carboxylate synthase. Alternative splice donor utilization generates isoforms with different sensitivity to ornithine inhibition. , 1999, The Journal of biological chemistry.

[50]  Xinchen Wang,et al.  Tissue-specific alternative splicing remodels protein-protein interaction networks. , 2012, Molecular cell.

[51]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[52]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[53]  A. Thompson,et al.  p53 mutant breast cancer patients expressing p53γ have as good a prognosis as wild-type p53 breast cancer patients , 2011, Breast Cancer Research.

[54]  Xiaolu Yang,et al.  c‐FLIPL is a dual function regulator for caspase‐8 activation and CD95‐mediated apoptosis , 2002, The EMBO journal.

[55]  Kenshi Hayashi,et al.  Characterization of caspase-8L: a novel isoform of caspase-8 that behaves as an inhibitor of the caspase cascade. , 2002, Blood.

[56]  Bernhard Schölkopf,et al.  Fast protein classification with multiple networks , 2005, ECCB/JBI.

[57]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.