Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics

Sub-cellular localisation of proteins is an essential post-translational regulatory mechanism that can be assayed using high-throughput mass spectrometry (MS). These MS-based spatial proteomics experiments enable us to pinpoint the sub-cellular distribution of thousands of proteins in a specific system under controlled conditions. Recent advances in high-throughput MS methods have yielded a plethora of experimental spatial proteomics data for the cell biology community. Yet, there are many third-party data sources, such as immunofluorescence microscopy or protein annotations and sequences, which represent a rich and vast source of complementary information. We present a unique transfer learning classification framework that utilises a nearest-neighbour or support vector machine system, to integrate heterogeneous data sources to considerably improve on the quantity and quality of sub-cellular protein assignment. We demonstrate the utility of our algorithms through evaluation of five experimental datasets, from four different species in conjunction with four different auxiliary data sources to classify proteins to tens of sub-cellular compartments with high generalisation accuracy. We further apply the method to an experiment on pluripotent mouse embryonic stem cells to classify a set of previously unknown proteins, and validate our findings against a recent high resolution map of the mouse stem cell proteome. The methodology is distributed as part of the open-source Bioconductor pRoloc suite for spatial proteomics data analysis. Abbreviations LOPIT Localisation of Organelle Proteins by Isotope Tagging PCP Protein Correlation Profiling ML Machine learning TL Transfer learning SVM Support vector machine PCA Principal component analysis GO Gene Ontology CC Cellular compartment iTRAQ Isobaric tags for relative and absolute quantitation TMT Tandem mass tags MS Mass spectrometry

[1]  L. Gatto,et al.  A draft map of the mouse pluripotent stem cell spatial proteome , 2016, Nature Communications.

[2]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[3]  A. Hoffmann,et al.  Training the 21st Century Immunologist. , 2015, Trends in immunology.

[4]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[5]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[6]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[7]  Arvind Kumar Tiwari,et al.  A Survey of Computational Intelligence Techniques in Protein Function Prediction , 2014, International journal of proteomics.

[8]  Edward L. Huttlin,et al.  MultiNotch MS3 Enables Accurate, Sensitive, and Multiplexed Detection of Differential Expression across Cancer Cell Line Proteomes , 2014, Analytical chemistry.

[9]  L. Gatto,et al.  Deciphering Thylakoid Sub-compartments using a Mass Spectrometry-based Approach* , 2014, Molecular & Cellular Proteomics.

[10]  Laurent Gatto,et al.  A Foundation for Reliable Spatial Proteomics Data Analysis* , 2014, Molecular & Cellular Proteomics.

[11]  Lusheng Wang,et al.  Predicting Human Protein Subcellular Locations by the Ensemble of Multiple Predictors via Protein-Protein Interaction Network with Edge Clustering Coefficients , 2014, PloS one.

[12]  Thomas Burger,et al.  Mass-spectrometry-based spatial proteomics data analysis using pRoloc and pRolocdata , 2014, Bioinform..

[13]  L. Gatto,et al.  CHAPTER 9:Spatial Proteomics: Practical Considerations for Data Acquisition and Analysis in Protein Subcellular Localisation Studies , 2014 .

[14]  L. Gatto,et al.  Identification of Trans-Golgi Network Proteins in Arabidopsis thaliana Root Tissue , 2013, Journal of proteome research.

[15]  Suyu Mei Corrigendum to “Multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization” [J. Theor. Biol. 293 (2012) 121–130] , 2013 .

[16]  Miguel A. Andrade-Navarro,et al.  A novel approach for protein subcellular location prediction using amino acid exposure , 2013, BMC Bioinformatics.

[17]  M. Trotter,et al.  The effect of organelle discovery upon sub-cellular protein localisation. , 2013, Journal of proteomics.

[18]  Kuo-Chen Chou,et al.  Recent advances in predicting protein classification and their applications to drug development. , 2013, Current topics in medicinal chemistry.

[19]  Michael Hippler,et al.  PredAlgo: a new subcellular localization prediction tool dedicated to green algae. , 2012, Molecular biology and evolution.

[20]  S. Munro,et al.  Putative Glycosyltransferases and Other Plant Golgi Apparatus Proteins Are Revealed by LOPIT Proteomics1[W] , 2012, Plant Physiology.

[21]  Jonathan Qiang Jiang,et al.  Predicting multiplex subcellular localization of proteins using protein-protein interaction network: a comparative study , 2012, BMC Bioinformatics.

[22]  Suyu Mei,et al.  Multi-Label Multi-Kernel Transfer Learning for Human Protein Subcellular Localization , 2012, PloS one.

[23]  Kuo-Chen Chou,et al.  Using protein-protein interaction network information to predict the subcellular locations of proteins in budding yeast. , 2012, Protein and peptide letters.

[24]  Suyu Mei,et al.  Multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization. , 2012, Journal of theoretical biology.

[25]  Le-Le Hu,et al.  PSCL: predicting protein subcellular localization based on optimal functional domains. , 2012, Protein and peptide letters.

[26]  Germán Castellanos-Domínguez,et al.  An adaptation of Pfam profiles to predict protein sub-cellular localization in Gram positive bacteria , 2012, 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[27]  Yan Wang,et al.  Using a novel AdaBoost algorithm and Chou's Pseudo amino acid composition for predicting protein subcellular localization. , 2011, Protein and peptide letters.

[28]  Matthias Mann,et al.  The mitochondrial contact site complex, a determinant of mitochondrial architecture , 2011, The EMBO journal.

[29]  Wolfgang Link,et al.  Protein localization in disease and therapy , 2011, Journal of Cell Science.

[30]  F. Pontén,et al.  Large-scale protein profiling in human cell lines using antibody-based proteomics. , 2011, Journal of proteome research.

[31]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[32]  Xin Wang,et al.  Recent progress in predicting protein sub-subcellular locations , 2011, Expert review of proteomics.

[33]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[34]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[35]  Shuigeng Zhou,et al.  Gene ontology based transfer learning for protein subcellular localization , 2011, BMC Bioinformatics.

[36]  E. Lundberg,et al.  Towards a knowledge-based Human Protein Atlas , 2010, Nature Biotechnology.

[37]  M. Trotter,et al.  Improved sub‐cellular resolution via simultaneous analysis of organelle proteomics data across varied experimental conditions , 2010, Proteomics.

[38]  Juan Antonio Vizcaíno,et al.  Organelle proteomics experimental designs and analysis , 2010, Proteomics.

[39]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[40]  Loris Nanni,et al.  High performance set of PseAAC and sequence based descriptors for protein classification. , 2010, Journal of theoretical biology.

[41]  Juri Rappsilber,et al.  The Protein Composition of Mitotic Chromosomes Determined Using Multiclassifier Combinatorial Proteomics , 2010, Cell.

[42]  Hang Li,et al.  Asymmetric Kernel Learning , 2010 .

[43]  Oliver Kohlbacher,et al.  YLoc—an interpretable web server for predicting subcellular localization , 2010, Nucleic Acids Res..

[44]  Oliver Kohlbacher,et al.  Going from where to why—interpretable prediction of protein subcellular localization , 2010, Bioinform..

[45]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[46]  Burkhard Rost,et al.  Bioinformatics predictions of localization and targeting. , 2010, Methods in molecular biology.

[47]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[48]  Emilio Soria Olivas,et al.  Handbook of Research on Machine Learning Applications and Trends : Algorithms , Methods , and Techniques , 2009 .

[49]  Kathryn S Lilley,et al.  Mapping organelle proteins and protein complexes in Drosophila melanogaster. , 2009, Journal of proteome research.

[50]  Ruedi Aebersold,et al.  Mass-spectrometric identification and relative quantification of N-linked cell surface glycoproteins , 2009, Nature Biotechnology.

[51]  Kathryn S Lilley,et al.  The Organelle Proteome of the DT40 Lymphocyte Cell Line* , 2009, Molecular & Cellular Proteomics.

[52]  Kathryn S Lilley,et al.  Sub‐cellular localization of membrane proteins , 2008, Proteomics.

[53]  Ethem Alpaydin,et al.  Localized multiple kernel learning , 2008, ICML '08.

[54]  Fengmin Li,et al.  Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach. , 2008, Protein and peptide letters.

[55]  Hao Lin,et al.  Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition. , 2008, Protein and peptide letters.

[56]  Shiow-Fen Hwang,et al.  ProLoc-GO: Utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization , 2008, BMC Bioinformatics.

[57]  H. Waterham,et al.  Proteomics Characterization of Mouse Kidney Peroxisomes by Tandem Mass Spectrometry and Protein Correlation Profiling*S , 2007, Molecular & Cellular Proteomics.

[58]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[59]  Paul Horton,et al.  Nucleic Acids Research Advance Access published May 21, 2007 WoLF PSORT: protein localization predictor , 2007 .

[60]  S. Brunak,et al.  Locating proteins in the cell using TargetP, SignalP and related tools , 2007, Nature Protocols.

[61]  Conrad Bessant,et al.  Quantitative proteomic approach to study subcellular localization of membrane proteins , 2006, Nature Protocols.

[62]  Piero Fariselli,et al.  BaCelLo: a balanced subcellular localization predictor , 2006, ISMB.

[63]  Rod B. Watson,et al.  Mapping the Arabidopsis organelle proteome. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[64]  Xiaohui S. Xie,et al.  A Mammalian Organelle Map by Protein Correlation Profiling , 2006, Cell.

[65]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[66]  Jean-Philippe Vert,et al.  A novel representation of protein sequences for prediction of subcellular location using support vector machines , 2005, Protein science : a publication of the Protein Society.

[67]  Thomas G. Dietterich,et al.  To transfer or not to transfer , 2005, NIPS 2005.

[68]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[69]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[70]  Thomas G. Dietterich,et al.  Improving SVM accuracy by training on auxiliary data sources , 2004, ICML.

[71]  M. Mann,et al.  Proteomic characterization of the human centrosome by protein correlation profiling , 2003, Nature.

[72]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[73]  Ke Wang,et al.  PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria , 2003, Nucleic Acids Res..

[74]  I. Macara,et al.  Structure of Cdc42 in a complex with the GTPase‐binding domain of the cell polarity protein, Par6 , 2003, The EMBO journal.

[75]  Mathias Dreger,et al.  Subcellular proteomics , 2021, Nature Reviews Methods Primers.

[76]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[77]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[78]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[79]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[80]  K. Chou,et al.  Support vector machines for prediction of protein subcellular location. , 2000, Molecular cell biology research communications : MCBRC.

[81]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[82]  G. Joberty,et al.  The cell-polarity protein Par6 links Par3 and atypical protein kinase C to Cdc42 , 2000, Nature Cell Biology.

[83]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[84]  A Cumano,et al.  A novel proteolytic cleavage involved in Notch signaling: the role of the disintegrin-metalloprotease TACE. , 2000, Molecular cell.

[85]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[86]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[87]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[88]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[89]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[90]  Simon Haykin,et al.  Generalized support vector machines , 1999, ESANN.

[91]  J. Caldwell,et al.  Interaction of Muscle and Brain Sodium Channels with Multiple Members of the Syntrophin Family of Dystrophin-Associated Proteins , 1998, The Journal of Neuroscience.

[92]  Gérard Dreyfus,et al.  Single-layer learning revisited: a stepwise procedure for building and training a neural network , 1989, NATO Neurocomputing.

[93]  C. de Duve,et al.  A short history of tissue fractionation , 1981, The Journal of cell biology.

[94]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[95]  Thomas Burger,et al.  Mass-spectrometry-based spatial proteomics data analysis using pRoloc and pRolocdata , 2014, Bioinform..

[96]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[97]  Emilio Soria Olivas,et al.  Handbook of Research on Machine Learning Applications and Trends : Algorithms , Methods , and Techniques , 2009 .

[98]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[99]  Thomas G. Dietterich,et al.  Improving SVM accuracy by training on auxiliary data sources , 2004, ICML.