SubCons: a new ensemble method for improved human subcellular localization predictions

Motivation: Knowledge of the correct protein subcellular localization is necessary for understanding the function of a protein. Unfortunately large‐scale experimental studies are limited in their accuracy. Therefore, the development of prediction methods has been limited by the amount of accurate experimental data. However, recently large‐scale experimental studies have provided new data that can be used to evaluate the accuracy of subcellular predictions in human cells. Using this data we examined the performance of state of the art methods and developed SubCons, an ensemble method that combines four predictors using a Random Forest classifier. Results: SubCons outperforms earlier methods in a dataset of proteins where two independent methods confirm the subcellular localization. Given nine subcellular localizations, SubCons achieves an F1‐Score of 0.79 compared to 0.70 of the second best method. Furthermore, at a FPR of 1% the true positive rate (TPR) is over 58% for SubCons compared to less than 50% for the best individual predictor. Availability and Implementation: SubCons is freely available as a webserver (http://subcons.bioinfo.se) and source code from https://bitbucket.org/salvatore_marco/subcons‐web‐server. The golden dataset as well is available from http://subcons.bioinfo.se/pred/download. Contact: arne@bioinfo.se Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[2]  K. Lilley,et al.  Determining protein subcellular localization in mammalian cell culture with biochemical fractionation and iTRAQ 8-plex quantification. , 2014, Methods in molecular biology.

[3]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[4]  Matthew Fraser,et al.  InterProScan 5: genome-scale protein function classification , 2014, Bioinform..

[5]  Carl Kingsford,et al.  What are decision trees? , 2008, Nature Biotechnology.

[6]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[7]  Piero Fariselli,et al.  TPpred2: improving the prediction of mitochondrial targeting peptide cleavage sites by exploiting sequence motifs , 2014, Bioinform..

[8]  B. Rost,et al.  Finding nuclear localization signals , 2000, EMBO reports.

[9]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[10]  M. Trotter,et al.  The effect of organelle discovery upon sub-cellular protein localisation. , 2013, Journal of proteomics.

[11]  K. Nakai Protein sorting signals and prediction of subcellular localization. , 2000, Advances in protein chemistry.

[12]  Oliver Kohlbacher,et al.  YLoc—an interpretable web server for predicting subcellular localization , 2010, Nucleic Acids Res..

[13]  E. Lundberg,et al.  Towards a knowledge-based Human Protein Atlas , 2010, Nature Biotechnology.

[14]  Johannes Söding,et al.  The MPI bioinformatics Toolkit as an integrative platform for advanced protein sequence and structure analysis , 2016, Nucleic Acids Res..

[15]  S. Brunak,et al.  SignalP 4.0: discriminating signal peptides from transmembrane regions , 2011, Nature Methods.

[16]  Henrik Nielsen,et al.  Predicting Subcellular Localization of Proteins by Bioinformatic Algorithms. , 2017, Current topics in microbiology and immunology.

[17]  N. Freitas,et al.  Mechanisms and Signals for the Nuclear Import of Proteins , 2009, Current genomics.

[18]  Burkhard Rost,et al.  Supporting online material for : LocTree 2 predicts localization for all domains of life , 2012 .

[19]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[20]  G. Heijne A new method for predicting signal sequence cleavage sites. , 1986 .

[21]  Hagit Shatkay,et al.  SherLoc2: a high-accuracy hybrid method for predicting subcellular localization of proteins. , 2009, Journal of proteome research.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  W. Funk,et al.  TAFA: a novel secreted family with conserved cysteine residues and restricted expression in the brain. , 2004, Genomics.

[24]  Oliver Kohlbacher,et al.  MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction , 2009, BMC Bioinformatics.

[25]  Jenn-Kang Hwang,et al.  Prediction of protein subcellular localization , 2006, Proteins.

[26]  Arne Elofsson,et al.  In silico prediction of the peroxisomal proteome in fungi, plants and animals. , 2003, Journal of molecular biology.

[27]  E. Lundberg,et al.  Mapping the subcellular protein distribution in three human cell lines. , 2011, Journal of proteome research.

[28]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  K. Nakai,et al.  Prediction of subcellular locations of proteins: Where to proceed? , 2010, Proteomics.

[31]  S. Brunak,et al.  Locating proteins in the cell using TargetP, SignalP and related tools , 2007, Nature Protocols.

[32]  Nicholas C. Bauer,et al.  Mechanisms Regulating Protein Localization , 2015, Traffic.

[33]  Ze-Guang Han,et al.  A novel liver‐specific zona pellucida domain containing protein that is expressed rarely in hepatocellular carcinoma , 2003, Hepatology.

[34]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[35]  Ryan E. Mills,et al.  Classical Nuclear Localization Signals: Definition, Function, and Interaction with Importin α* , 2007, Journal of Biological Chemistry.