Testing and Validation of Computational Methods for Mass Spectrometry.

High-throughput methods based on mass spectrometry (proteomics, metabolomics, lipidomics, etc.) produce a wealth of data that cannot be analyzed without computational methods. The impact of the choice of method on the overall result of a biological study is often underappreciated, but different methods can result in very different biological findings. It is thus essential to evaluate and compare the correctness and relative performance of computational methods. The volume of the data as well as the complexity of the algorithms render unbiased comparisons challenging. This paper discusses some problems and challenges in testing and validation of computational methods. We discuss the different types of data (simulated and experimental validation data) as well as different metrics to compare methods. We also introduce a new public repository for mass spectrometric reference data sets ( http://compms.org/RefData ) that contains a collection of publicly available data sets for performance evaluation for a wide range of different methods.

[1]  James E. Johnson,et al.  Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations , 2014, BMC Genomics.

[2]  Peter Z. Kunszt,et al.  Using synthetic peptides to benchmark peptide identification software and search parameters for MS/MS data analysis , 2014 .

[3]  Knut Reinert,et al.  LC-MSsim – a simulation software for liquid chromatography mass spectrometry data , 2008, BMC Bioinformatics.

[4]  Anne-Laure Boulesteix,et al.  Ten Simple Rules for Reducing Overoptimistic Reporting in Methodological Computational Research , 2015, PLoS Comput. Biol..

[5]  Anne-Laure Boulesteix,et al.  On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al , 2013, Bioinform..

[6]  Maxwell R. Mumbach,et al.  Dynamic profiling of the protein life cycle in response to pathogens , 2015, Science.

[7]  Anne-Laure Boulesteix,et al.  Cross-study validation for the assessment of prediction algorithms , 2014, Bioinform..

[8]  Eleni G. Christodoulou,et al.  Assessing Computational Methods for Transcription Factor Target Gene Identification Based on ChIP-seq Data , 2013, PLoS Comput. Biol..

[9]  Stefan Tenzer,et al.  In‐depth evaluation of software tools for data‐independent acquisition based label‐free quantification , 2015, Proteomics.

[10]  Andreas Beyer,et al.  Post-transcriptional Expression Regulation in the Yeast Saccharomyces cerevisiae on a Genomic Scale*S , 2004, Molecular & Cellular Proteomics.

[11]  Mihaela Zavolan,et al.  Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data , 2015, Genome Biology.

[12]  Lukas Käll,et al.  Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less. , 2015, Journal of proteome research.

[13]  Knut Reinert,et al.  MSSimulator: Simulation of mass spectrometry data. , 2011, Journal of proteome research.

[14]  L. Gatto,et al.  Identification of Trans-Golgi Network Proteins in Arabidopsis thaliana Root Tissue , 2013, Journal of proteome research.

[15]  Andrew R. Jones,et al.  ProteomeXchange provides globally co-ordinated proteomics data submission and dissemination , 2014, Nature Biotechnology.

[16]  Joaquín Dopazo,et al.  Papers on normalization, variable selection, classification or clustering of microarray data , 2009, Bioinform..

[17]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[18]  Lennart Martens,et al.  Managing expectations when publishing tools and methods for computational proteomics. , 2015, Journal of proteome research.

[19]  Anne-Laure Boulesteix,et al.  A Plea for Neutral Comparison Studies in Computational Sciences , 2012, PloS one.

[20]  S. Gygi,et al.  ms3 eliminates ratio distortion in isobaric multiplexed quantitative , 2011 .

[21]  M. Trotter,et al.  The effect of organelle discovery upon sub-cellular protein localisation. , 2013, Journal of proteomics.

[22]  K. Schughart,et al.  Data-driven assessment of eQTL mapping methods , 2010, BMC Genomics.

[23]  Kathryn S. Lilley,et al.  Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics , 2015, bioRxiv.

[24]  Robert Smith,et al.  JAMSS: proteomics mass spectrometry simulation in Java , 2015, Bioinform..

[25]  Andreas Beyer,et al.  Posttranscriptional Expression Regulation: What Determines Translation Rates? , 2007, PLoS Comput. Biol..

[26]  Lukas N. Mueller,et al.  SuperHirn – a novel tool for high resolution LC‐MS‐based peptide/protein profiling , 2007, Proteomics.

[27]  A. Boulesteix,et al.  A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies , 2015 .

[28]  E. Marcotte,et al.  Insights into the regulation of protein abundance from proteomic and transcriptomic analyses , 2012, Nature Reviews Genetics.

[29]  Dan Ventura,et al.  Novel algorithms and the benefits of comparative validation , 2013, Bioinform..

[30]  Thomas Burger,et al.  Mass-spectrometry-based spatial proteomics data analysis using pRoloc and pRolocdata , 2014, Bioinform..

[31]  Rob Smith,et al.  Mspire-Simulator: LC-MS shotgun proteomic simulator for creating realistic gold standard data. , 2013, Journal of proteome research.

[32]  Hyungwon Choi,et al.  EBprot: Statistical analysis of labeling‐based quantitative proteomics data , 2015, Proteomics.

[33]  Laurent Gatto,et al.  Improving qualitative and quantitative performance for MS(E)-based label-free proteomics. , 2013, Journal of proteome research.

[34]  Ruedi Aebersold Editorial: From Data to Results , 2011, Molecular & Cellular Proteomics.

[35]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.