Statistical Analysis in Proteomics

Compared to genomics or transcriptomics, proteomics is often regarded as an “emerging technology,” i.e., as not having reached the same level of maturity. While the successful implementation of proteomics workfl ows and technology still requires signifi cant levels of expertise and specialization, great strides have been made to make the technology more powerful, streamlined and accessible. In 2014, two landmark studies published the fi rst draft versions of the human proteome. We aim to provide an introduction specifi cally into the background of mass spectrometry (MS)-based proteomics. Within the fi eld, mass spectrometry has emerged as a core technology. Coupled to increasingly powerful separations and data processing and bioinformatics solution, it allows the quantitative analysis of whole proteomes within a matter of days, a timescale that has made global comparative proteome studies feasible at last. We present and discuss the basic concepts behind proteomics mass spectrometry and the accompanying topic of protein and peptide separations, with a focus on the properties of datasets emerging from such studies.

[1]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[2]  F. Quintana,et al.  Bayesian clustering and product partition models , 2003 .

[3]  Michael Q. Zhang,et al.  Combinatorial patterns of histone acetylations and methylations in the human genome , 2008, Nature Genetics.

[4]  David E Bruns,et al.  The STARD initiative and the reporting of studies of diagnostic accuracy. , 2003, Clinical chemistry.

[5]  M. Mann,et al.  Precision proteomics: The case for high resolution and high mass accuracy , 2008, Proceedings of the National Academy of Sciences.

[6]  Andreas Quandt,et al.  An automated pipeline for high-throughput label-free quantitative proteomics. , 2013, Journal of proteome research.

[7]  S. Wold,et al.  PLS: Partial Least Squares Projections to Latent Structures , 1993 .

[8]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[9]  Tamer Ölmez,et al.  Prostate Cancer Classification from Mass Spectrometry Data by Using Wavelet Analysis and Kernel Partial Least Squares Algorithm , 2013 .

[10]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[11]  Rasmus Bro,et al.  Improving the speed of multi-way algorithms:: Part I. Tucker3 , 1998 .

[12]  Zengyou He,et al.  Protein inference: a review , 2012, Briefings Bioinform..

[13]  Ron Wehrens,et al.  Stability-based biomarker selection. , 2011, Analytica chimica acta.

[14]  Søren Feodor Nielsen,et al.  Proper and Improper Multiple Imputation , 2003 .

[15]  A. Gelfand,et al.  The Nested Dirichlet Process , 2008 .

[16]  Judith A J Steen,et al.  When less can yield more – Computational preprocessing of MS/MS spectra for peptide identification , 2009, Proteomics.

[17]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[18]  Richard M. Simon,et al.  dimensional DNA microarray data , 2006 .

[19]  Lutgarde M. C. Buydens,et al.  Interpretation of variable importance in Partial Least Squares with Significance Multivariate Correlation (sMC) , 2014 .

[20]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[21]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[22]  Angelo J. Canty,et al.  Bootstrap Functions (Originally by Angelo Canty for S) , 2015 .

[23]  Qing Zeng-Treitler,et al.  Predicting sample size required for classification performance , 2012, BMC Medical Informatics and Decision Making.

[24]  Keji Zhao,et al.  Active chromatin domains are defined by acetylation islands revealed by genome-wide mapping. , 2005, Genes & development.

[25]  Xiao Zou,et al.  MilQuant: a free, generic software tool for isobaric tagging-based quantitation. , 2012, Journal of proteomics.

[26]  Hanno Steen,et al.  Estimating the confidence of peptide identifications without decoy databases. , 2010, Analytical chemistry.

[27]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[28]  C. Allis,et al.  The language of covalent histone modifications , 2000, Nature.

[29]  Stuart L. Schreiber,et al.  Methylation of histone H3 Lys 4 in coding regions of active genes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[30]  S. Ghosal,et al.  2 The Dirichlet process , related priors and posterior asymptotics , 2009 .

[31]  David L Tabb,et al.  DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring. , 2008, Journal of proteome research.

[32]  Ron Wehrens,et al.  Thresholding for biomarker selection in multivariate data using Higher Criticism. , 2012, Molecular bioSystems.

[33]  M. Trosset,et al.  Enhancement of sensitivity and resolution of surface-enhanced laser desorption/ionization time-of-flight mass spectrometric records for serum peptides using time-series analysis techniques. , 2005, Clinical chemistry.

[34]  A. Harris,et al.  REporting recommendations for tumour MARKer prognostic studies (REMARK) , 2005, British Journal of Cancer.

[35]  M. Campa,et al.  Analysis of human serum proteins by liquid phase isoelectric focusing and matrix‐assisted laser desorption/ionization‐mass spectrometry , 2003, Proteomics.

[36]  S. Henikoff,et al.  Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm , 2009, Nature Protocols.

[37]  Balaji Krishnapuram,et al.  Identification of differentially expressed proteins using MALDI-TOF mass spectra , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[38]  Edgar Wingender,et al.  Connecting high-dimensional mRNA and miRNA expression data for binary medical classification problems , 2013, Comput. Methods Programs Biomed..

[39]  P. Green,et al.  Modelling Heterogeneity With and Without the Dirichlet Process , 2001 .

[40]  Jeffrey S. Morris,et al.  Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum , 2005, Bioinform..

[41]  Constance van Eeden Restricted Parameter Space Estimation Problems: Admissibility and Minimaxity Properties , 2006 .

[42]  Damian Szklarczyk,et al.  STRING v9.1: protein-protein interaction networks, with increased coverage and integration , 2012, Nucleic Acids Res..

[43]  O. Kohlbacher,et al.  Probabilistic consensus scoring improves tandem mass spectrometry peptide identification. , 2011, Journal of proteome research.

[44]  Peter Müller,et al.  A Nonparametric Bayesian Model for Local Clustering With Application to Proteomics , 2013, Journal of the American Statistical Association.

[45]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[46]  B. Ripley Support Functions and Datasets for Venables and Ripley's MASS , 2015 .

[47]  P. McCullagh Partition models , 2015 .

[48]  J. Hartigan,et al.  Product Partition Models for Change Point Problems , 1992 .

[49]  T. Hothorn,et al.  Multivariate Normal and t Distributions , 2016 .

[50]  J. Miller,et al.  Predicting the Functional Effect of Amino Acid Substitutions and Indels , 2012, PloS one.

[51]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[52]  P. Müller,et al.  Defining Predictive Probability Functions for Species Sampling Models. , 2013, Statistical science : a review journal of the Institute of Mathematical Statistics.

[53]  Guillemette Marot,et al.  Statistical Applications in Genetics and Molecular Biology Sequential Analysis for Microarray Data Based on Sensitivity and Meta-Analysis , 2011 .

[54]  Sonia Petrone,et al.  An enriched conjugate prior for Bayesian nonparametric inference , 2011 .

[55]  Edward R. Dougherty,et al.  How many samples are needed to build a classifier: a general sequential approach , 2005, Bioinform..

[56]  J. Kingman The Representation of Partition Structures , 1978 .

[57]  Joram M. Posma,et al.  MetaboNetworks, an interactive Matlab-based toolbox for creating, customizing and exploring sub-networks from KEGG , 2013, Bioinform..

[58]  Robert Tibshirani,et al.  Sample classification from protein mass spectrometry, by 'peak probability contrasts' , 2004, Bioinform..