A Bayesian Mixture Modelling Approach For Spatial Proteomics

Analysis of the spatial sub-cellular distribution of proteins is of vital importance to fully understand context specific protein function. Some proteins can be found with a single location within a cell, but up to half of proteins may reside in multiple locations, can dynamically re-localise, or reside within an unknown functional compartment. These considerations lead to uncertainty in associating a protein to a single location. Currently, mass spectrometry (MS) based spatial proteomics relies on supervised machine learning algorithms to assign proteins to sub-cellular locations based on common gradient profiles. However, such methods fail to quantify uncertainty associated with sub-cellular class assignment. Here we reformulate the framework on which we perform statistical analysis. We propose a Bayesian generative classifier based on Gaussian mixture models to assign proteins probabilistically to sub-cellular niches, thus proteins have a probability distribution over sub-cellular locations, with Bayesian computation performed using the expectation-maximisation (EM) algorithm, as well as Markov-chain Monte-Carlo (MCMC). Our methodology allows proteome-wide uncertainty quantification, thus adding a further layer to the analysis of spatial proteomics. Our framework is flexible, allowing many different systems to be analysed and reveals new modelling opportunities for spatial proteomics. We find our methods perform competitively with current state-of-the art machine learning methods, whilst simultaneously providing more information. We highlight several examples where classification based on the support vector machine is unable to make any conclusions, while uncertainty quantification using our approach provides biologically intriguing results. To our knowledge this is the first Bayesian model of MS-based spatial proteomics data. Author summary Sub-cellular localisation of proteins provides insights into sub-cellular biological processes. For a protein to carry out its intended function it must be localised to the correct sub-cellular environment, whether that be organelles, vesicles or any sub-cellular niche. Correct sub-cellular localisation ensures the biochemical conditions for the protein to carry out its molecular function are met, as well as being near its intended interaction partners. Therefore, mis-localisation of proteins alters cell biochemistry and can disrupt, for example, signalling pathways or inhibit the trafficking of material around the cell. The sub-cellular distribution of proteins is complicated by proteins that can reside in multiple micro-environments, or those that move dynamically within the cell. Methods that predict protein sub-cellular localisation often fail to quantify the uncertainty that arises from the complex and dynamic nature of the sub-cellular environment. Here we present a Bayesian methodology to analyse protein sub-cellular localisation. We explicitly model our data and use Bayesian inference to quantify uncertainty in our predictions. We find our method is competitive with state-of-the-art machine learning methods and additionally provides uncertainty quantification. We show that, with this additional information, we can make deeper insights into the fundamental biochemistry of the cell.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  J. Cox,et al.  Global, quantitative and dynamic mapping of protein subcellular localization , 2016, eLife.

[3]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[4]  C. de Duve,et al.  A short history of tissue fractionation , 1981, The Journal of cell biology.

[5]  I. Cristea,et al.  A Portrait of the Human Organelle Proteome In Space and Time during Cytomegalovirus Infection. , 2016, Cell systems.

[6]  Margaret S Robinson,et al.  Role of the AP-5 adaptor protein complex in late endosome-to-Golgi retrieval , 2018, PLoS biology.

[7]  M. Stumpf,et al.  Systems biology (un)certainties , 2015, Science.

[8]  A. Rees,et al.  Appearance of functional insulin receptors during the differentiation of embryonal carcinoma cells , 1981, The Journal of cell biology.

[9]  M. Trotter,et al.  The effect of organelle discovery upon sub-cellular protein localisation. , 2013, Journal of proteomics.

[10]  Paul D. W. Kirk,et al.  Retroviruses integrate into a shared, non-palindromic DNA motif , 2016, Nature Microbiology.

[11]  Toby J Gibson,et al.  Cell regulation: determined to signal discrete cooperation. , 2009, Trends in biochemical sciences.

[12]  Javier Parapar,et al.  Additive Smoothing for Relevance-Based Language Modelling of Recommender Systems , 2016, CERI.

[13]  J. Heazlewood,et al.  Separation of the plant Golgi apparatus and endoplasmic reticulum by free-flow electrophoresis. , 2014, Methods in molecular biology.

[14]  ChengXiang Zhai,et al.  Axiomatic Analysis of Smoothing Methods in Language Models for Pseudo-Relevance Feedback , 2015, ICTIR.

[15]  L. Gatto,et al.  A draft map of the mouse pluripotent stem cell spatial proteome , 2016, Nature Communications.

[16]  M. Vihinen,et al.  Prediction of disease-related mutations affecting protein localization , 2009, BMC Genomics.

[17]  Christian Hennig,et al.  Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering , 2014, 1406.0808.

[18]  L. Gatto,et al.  Identification of Trans-Golgi Network Proteins in Arabidopsis thaliana Root Tissue , 2013, Journal of proteome research.

[19]  Laurent Gatto,et al.  Using hyperLOPIT to perform high-resolution mapping of the spatial proteome , 2017, Nature Protocols.

[20]  Michael Hippler,et al.  PredAlgo: a new subcellular localization prediction tool dedicated to green algae. , 2012, Molecular biology and evolution.

[21]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[22]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[23]  Damian C Crowther,et al.  Protein misfolding and disease: from the test tube to the organism. , 2008, Current opinion in chemical biology.

[24]  J. Todd,et al.  A method for identifying genetic heterogeneity within phenotypically-defined disease subgroups , 2016, Nature Genetics.

[25]  S. Gygi,et al.  MS3 eliminates ratio distortion in isobaric labeling-based multiplexed quantitative proteomics , 2011, Nature Methods.

[26]  M. von Zastrow,et al.  Subcellular localization of MC4R with ADCY3 at neuronal primary cilia underlies a common pathway for genetic predisposition to obesity , 2017, Nature Genetics.

[27]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[28]  Kathryn S Lilley,et al.  Mapping organelle proteins and protein complexes in Drosophila melanogaster. , 2009, Journal of proteome research.

[29]  C. Hennig Breakdown points for maximum likelihood estimators of location–scale mixtures , 2004, math/0410073.

[30]  Alexandra M. E. Jones,et al.  Identification of Regulatory and Cargo Proteins of Endosomal and Secretory Pathways in Arabidopsis thaliana by Proteomic Dissection* , 2015, Molecular & Cellular Proteomics.

[31]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[32]  Paul D. W. Kirk,et al.  Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements , 2011, BMC Bioinformatics.

[33]  Kathryn S. Lilley,et al.  MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation , 2012, Bioinform..

[34]  Juri Rappsilber,et al.  The Protein Composition of Mitotic Chromosomes Determined Using Multiclassifier Combinatorial Proteomics , 2010, Cell.

[35]  Kathryn S Lilley,et al.  The Organelle Proteome of the DT40 Lymphocyte Cell Line* , 2009, Molecular & Cellular Proteomics.

[36]  Laurent Gatto,et al.  A Bioconductor workflow for processing and analysing spatial proteomics data. , 2016, F1000Research.

[37]  Robert S Weiss,et al.  Viral oncoprotein-induced mislocalization of select PDZ proteins disrupts tight junctions and causes polarity defects in epithelial cells , 2005, Journal of Cell Science.

[38]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[39]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[40]  Adrian E. Raftery,et al.  Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering , 2007, J. Classif..

[41]  Conrad Bessant,et al.  Quantitative proteomic approach to study subcellular localization of membrane proteins , 2006, Nature Protocols.

[42]  Rod B. Watson,et al.  Mapping the Arabidopsis organelle proteome. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[43]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[44]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[45]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[46]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[47]  S. Munro,et al.  Putative Glycosyltransferases and Other Plant Golgi Apparatus Proteins Are Revealed by LOPIT Proteomics1[W] , 2012, Plant Physiology.

[48]  Elina Ikonen,et al.  When intracellular logistics fails - genetic defects in membrane trafficking , 2006, Journal of Cell Science.

[49]  J. Rodriguez,et al.  Cytoplasmic mislocalization of BRCA1 caused by cancer-associated mutations in the BRCT domain. , 2004, Experimental cell research.

[50]  Laurent Gatto,et al.  A Foundation for Reliable Spatial Proteomics Data Analysis* , 2014, Molecular & Cellular Proteomics.

[51]  Constance J Jeffery,et al.  Moonlighting proteins--an update. , 2009, Molecular bioSystems.

[52]  Hyungwon Choi,et al.  Analysis of protein complexes through model-based biclustering of label-free quantitative AP-MS data , 2010, Molecular systems biology.

[53]  J. Griffin,et al.  Localization of Organelle Proteins by Isotope Tagging (LOPIT)*S , 2004, Molecular & Cellular Proteomics.

[54]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[55]  Alberto Luini,et al.  Mendelian disorders of membrane trafficking. , 2011, The New England journal of medicine.

[56]  Juan Antonio Vizcaíno,et al.  Organelle proteomics experimental designs and analysis , 2010, Proteomics.

[57]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[58]  Edward L. Huttlin,et al.  MultiNotch MS3 Enables Accurate, Sensitive, and Multiplexed Detection of Differential Expression across Cancer Cell Line Proteomes , 2014, Analytical chemistry.

[59]  S. Gygi,et al.  ms3 eliminates ratio distortion in isobaric multiplexed quantitative , 2011 .

[60]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[62]  Kimberly A Kelly,et al.  Unexpected gain of function for the scaffolding protein plectin due to mislocalization in pancreatic cancer , 2013, Proceedings of the National Academy of Sciences.

[63]  Pamela A. Silver,et al.  Nuclear transport and cancer: from mechanism to intervention , 2004, Nature Reviews Cancer.

[64]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[65]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[66]  Eric Lécuyer,et al.  The many functions of mRNA localization during normal development and disease: from pillar to post , 2013, Wiley interdisciplinary reviews. Developmental biology.

[67]  Thomas Burger,et al.  Mass-spectrometry-based spatial proteomics data analysis using pRoloc and pRolocdata , 2014, Bioinform..

[68]  Xiaohui S. Xie,et al.  A Mammalian Organelle Map by Protein Correlation Profiling , 2006, Cell.