Learning Cellular Sorting Pathways Using Protein Interactions and Sequence Motifs

Proper subcellular localization is critical for proteins to perform their roles in cellular functions. Proteins are transported by different cellular sorting pathways, some of which take a protein through several intermediate locations until reaching its final destination. The pathway a protein is transported through is determined by carrier proteins that bind to specific sequence motifs. In this article, we present a new method that integrates protein interaction and sequence motif data to model how proteins are sorted through these sorting pathways. We use a hidden Markov model (HMM) to represent protein sorting pathways. The model is able to determine intermediate sorting states and to assign carrier proteins and motifs to the sorting pathways. In simulation studies, we show that the method can accurately recover an underlying sorting model. Using data for yeast, we show that our model leads to accurate prediction of subcellular localization. We also show that the pathways learned by our model recover many known sorting pathways and correctly assign proteins to the path they utilize. The learned model identified new pathways and their putative carriers and motifs and these may represent novel protein sorting mechanisms. Supplementary results and software implementation are available from http://murphylab.web.cmu.edu/software/2010_RECOMB_pathways/.

[1]  J. Diehl,et al.  Location, location, location: The role of cyclin D1 nuclear localization in cancer , 2005, Journal of cellular biochemistry.

[2]  R. Casadio,et al.  BaCelLo: a Balanced subCellular Localization predictor. , 2007 .

[3]  E. O’Shea,et al.  Global analysis of protein localization in budding yeast , 2003, Nature.

[4]  Peter D. Karp,et al.  Machine learning methods for metabolic pathway prediction , 2010 .

[5]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[6]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[7]  Stefan Wiemann,et al.  High-content screening microscopy identifies novel proteins with a putative role in secretory membrane traffic. , 2004, Genome research.

[8]  W. Richardson,et al.  The nucleoplasmin nuclear location sequence is larger and more complex than that of SV-40 large T antigen , 1988, The Journal of cell biology.

[9]  Roberto Sitia,et al.  Secretion of Mammalian Proteins that Lack a Signal Sequence , 1997 .

[10]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[11]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[12]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[13]  W. Skach,et al.  Defects in processing and trafficking of the cystic fibrosis transmembrane conductance regulator. , 2000, Kidney international.

[14]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[15]  Ziv Bar-Joseph,et al.  Ieee/acm Transactions on Computational Biology and Bioinformatics Discriminative Motif Finding for Predicting Protein Subcellular Localization , 2022 .

[16]  Uwe Sauer,et al.  Bacillus subtilis Metabolism and Energetics in Carbon-Limited and Excess-Carbon Chemostat Culture , 2001, Journal of bacteriology.

[17]  Tim J. P. Hubbard,et al.  NestedMICA as an ab initio protein motif discovery tool , 2008, BMC Bioinformatics.

[18]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[19]  B. Rost,et al.  Mimicking cellular sorting improves prediction of subcellular localization. , 2005, Journal of molecular biology.

[20]  Jiong Yang,et al.  PathFinder: mining signal transduction pathway segments from protein-protein interaction networks , 2007, BMC Bioinformatics.

[21]  Victor B. Strelets,et al.  FlyBase: anatomical data, images and queries , 2005, Nucleic Acids Res..

[22]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[23]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[24]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[25]  Pamela A. Silver,et al.  Nuclear transport and cancer: from mechanism to intervention , 2004, Nature Reviews Cancer.

[26]  N F LaRusso,et al.  Alternative splicing of the rat sodium/bile acid transporter changes its cellular localization and transport properties. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Michelle S. Scott,et al.  Global Survey of Organ and Organelle Protein Expression in Mouse: Combined Proteomic and Transcriptomic Profiling , 2006, Cell.

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Michelle S. Scott,et al.  Predicting subcellular localization via protein motif co-occurrence. , 2004, Genome research.

[30]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[31]  Kuo-Chen Chou,et al.  Predicting subcellular localization of proteins in a hybridization space , 2004, Bioinform..

[32]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[33]  L. Fulton,et al.  Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting , 2003, Science.

[34]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[35]  Renato De Mori,et al.  High-performance connected digit recognition using maximum mutual information estimation , 1994, IEEE Trans. Speech Audio Process..

[36]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[37]  M. Inouye,et al.  Cold‐shock induction of a family of TIP1‐related proteins associated with the membrane in Saccharomyces cerevisiae , 1995, Molecular microbiology.

[38]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[39]  Jieyue Li,et al.  Automated analysis of Human Protein Atlas immunofluorescence images , 2009, 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[40]  Robert F. Murphy,et al.  Automated comparison of protein subcellular location patterns between images of normal and cancerous tissues , 2008, 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[41]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[42]  E. Lundberg,et al.  Toward a Confocal Subcellular Atlas of the Human Proteome*S , 2008, Molecular & Cellular Proteomics.

[43]  Gertraud Burger,et al.  'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools , 2007, BMC Bioinformatics.

[44]  Mamoon Rashid,et al.  Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs , 2007, BMC Bioinformatics.

[45]  E. Lundberg,et al.  A Genecentric Human Protein Atlas for Expression Profiles Based on Antibodies* , 2008, Molecular & Cellular Proteomics.

[46]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[47]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[48]  Peter R. Shewry,et al.  N-terminal amino acid sequence of C hordein , 1980 .

[49]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[50]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[51]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[52]  Satoru Miyano,et al.  Extensive feature detection of N-terminal protein sorting signals , 2002, Bioinform..

[53]  Y. Takada,et al.  Identification of mutations associated with peroxisome-to-mitochondrion mistargeting of alanine/glyoxylate aminotransferase in primary hyperoxaluria type 1 , 1990, The Journal of cell biology.

[54]  Michael Q. Zhang,et al.  Identifying tissue-selective transcription factor binding sites in vertebrate promoters. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Nir Friedman,et al.  A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites , 2001, WABI.

[56]  Trey Ideker,et al.  Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species , 2008, Nucleic acids research.

[57]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[58]  Michael T. Hallett,et al.  Refining Protein Subcellular Localization , 2005, PLoS Comput. Biol..

[59]  D. Jans,et al.  Regulation of Nuclear Transport: Central Role in Development and Transformation? , 2005, Traffic.

[60]  Jun Kawai,et al.  Subcellular Localization of Mammalian Type II Membrane Proteins , 2006, Traffic.

[61]  Luay Nakhleh,et al.  Rapidly exploring structural and dynamic properties of signaling networks using PathwayOracle , 2008, BMC Systems Biology.

[62]  Michael Q. Zhang,et al.  DWE: Discriminating Word Enumerator , 2005, Bioinform..

[63]  Kenta Nakai,et al.  Large-scale analysis of human alternative protein isoforms: pattern classification and correlation with subcellular localization signals , 2005, Nucleic acids research.

[64]  Roded Sharan,et al.  A motif-based framework for recognizing sequence families , 2005, ISMB.

[65]  Andrew M. Jenkinson,et al.  Ensembl 2009 , 2008, Nucleic Acids Res..

[66]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[67]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[68]  Jörg Schultz,et al.  HMM Logos for visualization of protein families , 2004, BMC Bioinformatics.

[69]  N. Blom,et al.  Feature-based prediction of non-classical and leaderless protein secretion. , 2004, Protein engineering, design & selection : PEDS.

[70]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[71]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[72]  R. Aebersold,et al.  Evolution of organelle-associated protein profiling. , 2009, Journal of proteomics.

[73]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[74]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[75]  P. S. St George-Hyslop,et al.  Phosphorylation, Subcellular Localization, and Membrane Orientation of the Alzheimer's Disease-associated Presenilins* , 1997, The Journal of Biological Chemistry.

[76]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[77]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[78]  Hagit Shatkay,et al.  Significantly Improved Prediction of Subcellular Localization by Integrating Text and Protein Sequence Data , 2005, Pacific Symposium on Biocomputing.

[79]  P. Rosenthal,et al.  Falcipain Cysteine Proteases Require Bipartite Motifs for Trafficking to the Plasmodium falciparum Food Vacuole* , 2007, Journal of Biological Chemistry.

[80]  Roded Sharan,et al.  Efficient Algorithms for Detecting Signaling Pathways in Protein Interaction Networks , 2005, RECOMB.

[81]  Estelle Glory-Afshar,et al.  Determining the distribution of probes between different subcellular locations through automated unmixing of subcellular patterns , 2010, Proceedings of the National Academy of Sciences.

[82]  Roded Sharan,et al.  Efficient Algorithms for Detecting Signaling Pathways in Protein Interaction Networks , 2006, J. Comput. Biol..

[83]  Elvira García Osuna,et al.  Large-Scale Automated Analysis of Location Patterns in Randomly Tagged 3T3 Cells , 2007, Annals of Biomedical Engineering.

[84]  S Subramani,et al.  Identification of Peroxisomal Targeting Signals Located at the Carboxy Terminus of Four Peroxisomal Proteins Materials and Methods Reagents , 1988 .

[85]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[86]  S. Brunak,et al.  Locating proteins in the cell using TargetP, SignalP and related tools , 2007, Nature Protocols.

[87]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[88]  Robert F. Murphy,et al.  Automated image analysis of protein localization in budding yeast , 2007, ISMB/ECCB.

[89]  Timothy L. Bailey,et al.  Discriminative motif discovery in DNA and protein sequences using the DEME algorithm , 2007, BMC Bioinformatics.

[90]  D. Goldfarb,et al.  Nucleus-vacuole junctions in Saccharomyces cerevisiae are formed through the direct interaction of Vac8p with Nvj1p. , 2000, Molecular biology of the cell.

[91]  U. Sauer,et al.  Large-scale in vivo flux analysis shows rigidity and suboptimal performance of Bacillus subtilis metabolism , 2005, Nature Genetics.

[92]  Michael Q. Zhang,et al.  Mining ChIP-chip data for transcription factor and cofactor binding sites , 2005, ISMB.

[93]  Markus J. Herrgård,et al.  Integrating high-throughput and computational data elucidates bacterial networks , 2004, Nature.

[94]  G. Yarrington Molecular Cell Biology , 1987, The Yale Journal of Biology and Medicine.

[95]  Xuelong Li,et al.  A survey of graph edit distance , 2010, Pattern Analysis and Applications.

[96]  R. Milo,et al.  Dynamic Proteomics of Individual Cancer Cells in Response to a Drug , 2008, Science.

[97]  J. H. Shinn,et al.  Minimotif Miner: a tool for investigating protein function , 2006, Nature Methods.

[98]  Piero Fariselli,et al.  BaCelLo: a balanced subcellular localization predictor , 2006, ISMB.