Ieee/acm Transactions on Computational Biology and Bioinformatics Discriminative Motif Finding for Predicting Protein Subcellular Localization

Many methods have been described to predict the subcellular location of proteins from sequence information. However, most of these methods either rely on global sequence properties or use a set of known protein targeting motifs to predict protein localization. Here, we develop and test a novel method that identifies potential targeting motifs using a discriminative approach based on hidden Markov models (discriminative HMMs). These models search for motifs that are present in a compartment but absent in other, nearby, compartments by utilizing an hierarchical structure that mimics the protein sorting mechanism. We show that both discriminative motif finding and the hierarchical structure improve localization prediction on a benchmark data set of yeast proteins. The motifs identified can be mapped to known targeting motifs and they are more conserved than the average protein sequence. Using our motif-based predictions, we can identify potential annotation errors in public databases for the location of some of the proteins. A software implementation and the data set described in this paper are available from http://murphylab.web.cmu.edu/software/2009_TCBB_motif/.

[1]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[2]  Stefan Wiemann,et al.  High-content screening microscopy identifies novel proteins with a putative role in secretory membrane traffic. , 2004, Genome research.

[3]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[4]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[5]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[6]  Robert F. Murphy,et al.  Automated image analysis of protein localization in budding yeast , 2007, ISMB/ECCB.

[7]  Timothy L. Bailey,et al.  Discriminative motif discovery in DNA and protein sequences using the DEME algorithm , 2007, BMC Bioinformatics.

[8]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[9]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[10]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[11]  D. Goldfarb,et al.  Nucleus-vacuole junctions in Saccharomyces cerevisiae are formed through the direct interaction of Vac8p with Nvj1p. , 2000, Molecular biology of the cell.

[12]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[13]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[14]  Renato De Mori,et al.  High-performance connected digit recognition using maximum mutual information estimation , 1994, IEEE Trans. Speech Audio Process..

[15]  Michael T. Hallett,et al.  Refining Protein Subcellular Localization , 2005, PLoS Comput. Biol..

[16]  W. Richardson,et al.  The nucleoplasmin nuclear location sequence is larger and more complex than that of SV-40 large T antigen , 1988, The Journal of cell biology.

[17]  Alex Bateman,et al.  The InterPro Database, 2003 brings increased coverage and new features , 2003, Nucleic Acids Res..

[18]  Mamoon Rashid,et al.  Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs , 2007, BMC Bioinformatics.

[19]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[20]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[21]  B. Rost,et al.  Mimicking cellular sorting improves prediction of subcellular localization. , 2005, Journal of molecular biology.

[22]  M. Inouye,et al.  Cold‐shock induction of a family of TIP1‐related proteins associated with the membrane in Saccharomyces cerevisiae , 1995, Molecular microbiology.

[23]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[24]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[25]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[26]  Tim J. P. Hubbard,et al.  NestedMICA as an ab initio protein motif discovery tool , 2008, BMC Bioinformatics.

[27]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[28]  J. H. Shinn,et al.  Minimotif Miner: a tool for investigating protein function , 2006, Nature Methods.

[29]  L. Fulton,et al.  Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting , 2003, Science.

[30]  Satoru Miyano,et al.  Extensive feature detection of N-terminal protein sorting signals , 2002, Bioinform..

[31]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[32]  GusfieldDan Introduction to the IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2004 .

[33]  R. Aebersold,et al.  Evolution of organelle-associated protein profiling. , 2009, Journal of proteomics.

[34]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[35]  Elvira García Osuna,et al.  Large-Scale Automated Analysis of Location Patterns in Randomly Tagged 3T3 Cells , 2007, Annals of Biomedical Engineering.

[36]  D. Jans,et al.  Regulation of Nuclear Transport: Central Role in Development and Transformation? , 2005, Traffic.

[37]  Michael Q. Zhang,et al.  Mining ChIP-chip data for transcription factor and cofactor binding sites , 2005, ISMB.

[38]  Jörg Schultz,et al.  HMM Logos for visualization of protein families , 2004, BMC Bioinformatics.

[39]  Renato De Mori,et al.  High performance connected digit recognition using maximum mutual information estimation , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[40]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[41]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[42]  Kei-Hoi Cheung,et al.  Large-scale analysis of the yeast genome by transposon tagging and gene disruption , 1999, Nature.

[43]  P. Rosenthal,et al.  Falcipain Cysteine Proteases Require Bipartite Motifs for Trafficking to the Plasmodium falciparum Food Vacuole* , 2007, Journal of Biological Chemistry.

[44]  E. O’Shea,et al.  Global analysis of protein localization in budding yeast , 2003, Nature.

[45]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[46]  Jun Kawai,et al.  Subcellular Localization of Mammalian Type II Membrane Proteins , 2006, Traffic.

[47]  Pamela A. Silver,et al.  Nuclear transport and cancer: from mechanism to intervention , 2004, Nature Reviews Cancer.

[48]  S Subramani,et al.  Identification of Peroxisomal Targeting Signals Located at the Carboxy Terminus of Four Peroxisomal Proteins Materials and Methods Reagents , 1988 .