PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features

Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments. PolySTest provides a user-friendly web service for statistical testing, data browsing and visualization, including a new method that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. By combining different statistical tests, PolySTest improves robustness and confidence for simulated, experimental ground truth, and biological data. Graphical Abstract Highlights Novel statistical test combining missingness and quantitative profiles. Unification of different statistical tests into a PolySTest FDR provides higher robustness and confidence. PolySTest provides higher coverage of relevant biological pathways. User-friendly interactive web service for statistical analysis and visualization. Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10–20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

[1]  Y. Levin,et al.  MS1-based label-free proteomics using a quadrupole orbitrap mass spectrometer. , 2015, Journal of proteome research.

[2]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[3]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[4]  Yasset Perez-Riverol,et al.  A multi-center study benchmarks software tools for label-free proteome quantification , 2016, Nature Biotechnology.

[5]  A. Bigot,et al.  Skeletal muscle characteristics are preserved in hTERT/cdk4 human myogenic cell lines , 2016, Skeletal Muscle.

[6]  Roland Eils,et al.  circlize implements and enhances circular visualization in R , 2014, Bioinform..

[7]  Rainer Breitling,et al.  Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments , 2004, FEBS letters.

[8]  O. Jensen,et al.  FlashPack: Fast and Simple Preparation of Ultrahigh-performance Capillary Columns for LC-MS* , 2018, Molecular & Cellular Proteomics.

[9]  Brendan MacLean,et al.  MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments , 2014, Bioinform..

[10]  F. Muntoni,et al.  Immortalized pathological human myoblasts: towards a universal tool for the study of neuromuscular disorders , 2011, Skeletal Muscle.

[11]  Alexander Lex,et al.  UpSetR: an R package for the visualization of intersecting sets and their properties , 2017, bioRxiv.

[12]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[13]  Joel G. Pounds,et al.  Combined Statistical Analyses of Peptide Intensities and Peptide Occurrences Improves Identification of Significant Peptides from MS-Based Proteomics Data , 2010, Journal of proteome research.

[14]  Johannes Griss,et al.  IsoProt: A Complete and Reproducible Workflow To Analyze iTRAQ/TMT Experiments , 2018, Journal of proteome research.

[15]  Martin Eisenacher,et al.  The PRIDE database and related tools and resources in 2019: improving support for quantification data , 2018, Nucleic Acids Res..

[16]  M. Mann,et al.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification , 2008, Nature Biotechnology.

[17]  Jay W. Schneider,et al.  Interaction of myogenic factors and the retinoblastoma protein mediates muscle cell commitment and differentiation , 1993, Cell.

[18]  O. Jensen,et al.  Evaluation of sample preparation methods for mass spectrometry-based proteomic analysis of barley leaves , 2018, Plant Methods.

[19]  Anna-Lena Lamprecht,et al.  Automated workflow composition in mass spectrometry-based proteomics , 2018, Bioinform..

[20]  J. Shay,et al.  Cellular senescence in human myoblasts is overcome by human telomerase reverse transcriptase and cyclin‐dependent kinase 4: consequences in aging muscle and therapeutic strategies for muscular dystrophies , 2007, Aging cell.

[21]  Marco Y. Hein,et al.  The Perseus computational platform for comprehensive analysis of (prote)omics data , 2016, Nature Methods.

[22]  Ludovic C. Gillet,et al.  Targeted Data Extraction of the MS/MS Spectra Generated by Data-independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis* , 2012, Molecular & Cellular Proteomics.

[23]  M. Mann,et al.  Andromeda: a peptide search engine integrated into the MaxQuant environment. , 2011, Journal of proteome research.

[24]  Veit Schwämmle,et al.  Assessment and improvement of statistical tools for comparative proteomics analysis of sparse data sets with few experimental replicates. , 2013, Journal of proteome research.

[25]  John D. Storey A direct approach to false discovery rates , 2002 .

[26]  Jonathan Sidi,et al.  heatmaply: an R package for creating interactive cluster heatmaps for online publishing , 2017, Bioinform..

[27]  Kris Gevaert,et al.  Experimental design and data-analysis in label-free quantitative LC/MS proteomics: A tutorial with MSqRob. , 2018, Journal of proteomics.

[28]  G. Hommel A stagewise rejective multiple test procedure based on a modified Bonferroni test , 1988 .

[29]  Veit Schwämmle,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .