Supporting online material for : LocTree 2 predicts localization for all domains of life

Motivation: Subcellular localization is one aspect of protein function. Despite advances in high-throughput imaging, localization maps remain incomplete. Several methods accurately predict localization, but many challenges remain to be tackled. Results: In this study, we introduced a framework to predict localization in life's three domains, including globular and membrane proteins (3 classes for archaea; 6 for bacteria and 18 for eukaryota). The resulting method, LocTree2, works well even for protein fragments. It uses a hierarchical system of support vector machines that imitates the cascading mechanism of cellular sorting. The method reaches high levels of sustained performance (eukaryota: Q18=65%, bacteria: Q6=84%). LocTree2 also accurately distinguishes membrane and non-membrane proteins. In our hands, it compared favorably with top methods when tested on new data. Availability: Online through PredictProtein (predictprotein.org); as standalone version at http://www.rostlab.org/services/loctree2. Contact: localization@rostlab.org Supplementary Information: Supplementary data are available at Bioinformatics online.

[1]  Rainer Breitling,et al.  Predicting protein function by machine learning on amino acid sequences – a critical evaluation , 2007, BMC Genomics.

[2]  Hagit Shatkay,et al.  SherLoc2: a high-accuracy hybrid method for predicting subcellular localization of proteins. , 2009, Journal of proteome research.

[3]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[4]  Journal of Molecular Biology , 1959, Nature.

[5]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[6]  C. A. Andersen,et al.  Prediction of human protein function from post-translational modifications and localization features. , 2002, Journal of molecular biology.

[7]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[8]  Burkhard Rost,et al.  UniqueProt: creating representative protein sequence sets , 2003, Nucleic Acids Res..

[9]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[10]  Erik L. L. Sonnhammer,et al.  An HMM posterior decoder for sequence feature prediction that includes homology information , 2005, ISMB.

[12]  Jenn-Kang Hwang,et al.  Prediction of protein subcellular localization , 2006, Proteins.

[13]  B. Rost,et al.  Adaptation of protein surfaces to subcellular location. , 1998, Journal of molecular biology.

[14]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[15]  Zhongzhi Shi,et al.  Advanced Artificial Intelligence , 2011, Series on Intelligence Science.

[16]  Hagit Shatkay,et al.  Pacific Symposium on Biocomputing 13:604-615(2008) EPILOC: A (WORKING) TEXT-BASED SYSTEM FOR PREDICTING PROTEIN SUBCELLULAR LOCATION , 2022 .

[17]  Oliver Kohlbacher,et al.  MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction , 2009, BMC Bioinformatics.

[18]  B. Rost,et al.  Mimicking cellular sorting improves prediction of subcellular localization. , 2005, Journal of molecular biology.

[19]  Ke Wang,et al.  Profile-based string kernels for remote homology detection and motif extraction , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[20]  M. Brent,et al.  Recent advances in gene structure prediction. , 2004, Current opinion in structural biology.

[21]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[22]  N. Blom,et al.  Feature-based prediction of non-classical and leaderless protein secretion. , 2004, Protein engineering, design & selection : PEDS.

[23]  B. Rost,et al.  Finding nuclear localization signals , 2000, EMBO reports.

[24]  B. Rost,et al.  Annotating proteins from endoplasmic reticulum and Golgi apparatus in eukaryotic proteomes , 2004, Cellular and Molecular Life Sciences CMLS.

[25]  Oliver Kohlbacher,et al.  MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition , 2006, Bioinform..

[26]  Ray H. Baughman,et al.  Supporting Online Material , 2003 .

[27]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[28]  A. Sobel,et al.  The Journal of Biological Chemistry. , 2009, Nutrition reviews.

[29]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[30]  D A Kendall,et al.  Protein transport via amino-terminal targeting sequences: common themes in diverse systems. , 1995, Molecular membrane biology.

[31]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[32]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[33]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[34]  J. Fox Applied Regression Analysis, Linear Models, and Related Methods , 1997 .

[35]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[36]  John C. Wilkinson,et al.  COMMD Proteins, a Novel Family of Structural and Functional Homologs of MURR1* , 2005, Journal of Biological Chemistry.

[37]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[38]  Alfonso Valencia,et al.  CAFASP3 in the spotlight of EVA , 2003, Proteins.

[39]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[40]  Queensland Section Staff Ieee 1994 Second Australian and New Zealand Conference on Intelligent Information Systems , 1994 .

[41]  F. Wieland,et al.  Oligomeric State and Stoichiometry of p24 Proteins in the Early Secretory Pathway* 210 , 2002, The Journal of Biological Chemistry.

[42]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[43]  A. T. Rêgo,et al.  Structural biology of bacterial secretion systems in gram-negative pathogens--potential for new drug targets. , 2009, Infectious disorders drug targets.

[44]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[45]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[46]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[47]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[48]  O. Kandler,et al.  Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Burkhard Rost,et al.  The PredictProtein server , 2003, Nucleic Acids Res..

[50]  Burkhard Rost,et al.  Inferring sub-cellular localization through automated lexical analysis , 2002, ISMB.

[51]  Burkhard Rost,et al.  LocDB: experimental annotations of localization for Homo sapiens and Arabidopsis thaliana , 2010, Nucleic Acids Res..

[52]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[53]  Jaime G. Carbonell,et al.  Machine learning research , 1981, SGAR.

[54]  Stefan Kramer,et al.  Ensembles of nested dichotomies for multi-class problems , 2004, ICML.

[55]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[56]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[57]  G. von Heijne,et al.  Prediction of membrane-protein topology from first principles , 2008, Proceedings of the National Academy of Sciences.

[58]  Hitoshi Okamoto,et al.  Canopy1, a Novel Regulator of FGF Signaling around the Midbrain-Hindbrain Boundary in Zebrafish , 2006, Current Biology.

[59]  Martin Ester,et al.  PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes , 2010, Bioinform..

[60]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..